I need to merge several datasets (including lists mentioned in the title) in order to estimate the gap in IUCN species assessments in Italy. My main dataset is a 20-year-old checklist of all fauna species present in Italy, and I would like to merge it by species and subspecies name to national, European and global red lists and also policy assessments and priority lists. Being this old, my reference checklist is likely to have many outdated species names as well as subspecies names that will appear different in the Red List datasets.
I have tried using merge tools in Excel or fuzzyjoin in RStudio, however this are not as efficient as I would like them to be, especially for those species that have drastically changed name along the years.
Such a long species list (between 60,000 and 80,000 species across the main checklist and some others) is unlikely to be addressed manually to check for merging errors, so I ask here if there are machine learning methods in Python (or any other method that I have not listed above) to merge datasets with a high level of accuracy.
Thank you!
P.S. I have already considered to subdivide the datasets between species and subspecies and then merge accordingly at least to reduce the mistake rate. Do you recommend this option?
I need to merge several datasets (including lists mentioned in the title) in order to estimate the gap in IUCN species assessments in Italy. My main dataset is a 20-year-old checklist of all fauna species present in Italy, and I would like to merge it by species and subspecies name to national, European and global red lists and also policy assessments and priority lists. Being this old, my reference checklist is likely to have many outdated species names as well as subspecies names that will appear different in the Red List datasets.
I have tried using merge tools in Excel or fuzzyjoin in RStudio, however this are not as efficient as I would like them to be, especially for those species that have drastically changed name along the years.
Such a long species list (between 60,000 and 80,000 species across the main checklist and some others) is unlikely to be addressed manually to check for merging errors, so I ask here if there are machine learning methods in Python (or any other method that I have not listed above) to merge datasets with a high level of accuracy.
Thank you!
P.S. I have already considered to subdivide the datasets between species and subspecies and then merge accordingly at least to reduce the mistake rate. Do you recommend this option?
Share Improve this question asked Jan 20 at 17:59 Ema95Ema95 11 Answer
Reset to default 0I'm unaware of a ML approach to any problems like this, but the taxize
library in R might be able to help. Here is the github repo for the library. Specifically, there is the gnr_resolve()
function, which might be a good fit for your use case. See documentation for that function here.
Another library that might be of use is taxdb
, which says it can be used to resolve taxonomic names to identifiers, but I've never used it. Repo here.
You could also loop through the species emitting how many missing entries you have so you have an idea of error rate in your dataset by taxa.