How to merge a dataset of 80,000 species with Red List data from several assessments when species names do not coincide

I need to merge several datasets (including lists mentioned in the title) in order to estimate the gap in IUCN species assessments in Italy. My main dataset is a 20-year-old checklist of all fauna species present in Italy, and I would like to merge it by species and subspecies name to national, European and global red lists and also policy assessments and priority lists. Being this old, my reference checklist is likely to have many outdated species names as well as subspecies names that will appear different in the Red List datasets.

I have tried using merge tools in Excel or fuzzyjoin in RStudio, however this are not as efficient as I would like them to be, especially for those species that have drastically changed name along the years.

Such a long species list (between 60,000 and 80,000 species across the main checklist and some others) is unlikely to be addressed manually to check for merging errors, so I ask here if there are machine learning methods in Python (or any other method that I have not listed above) to merge datasets with a high level of accuracy.

Thank you!

P.S. I have already considered to subdivide the datasets between species and subspecies and then merge accordingly at least to reduce the mistake rate. Do you recommend this option?

Thank you!

P.S. I have already considered to subdivide the datasets between species and subspecies and then merge accordingly at least to reduce the mistake rate. Do you recommend this option?

Share Improve this question asked Jan 20 at 17:59 Ema95 1

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I'm unaware of a ML approach to any problems like this, but the taxize library in R might be able to help. Here is the github repo for the library. Specifically, there is the gnr_resolve() function, which might be a good fit for your use case. See documentation for that function here.

Another library that might be of use is taxdb, which says it can be used to resolve taxonomic names to identifiers, but I've never used it. Repo here.

You could also loop through the species emitting how many missing entries you have so you have an idea of error rate in your dataset by taxa.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

How to merge a dataset of 80,000 species with Red List data from several assessments when species names do not coincide - Stack

1 Answer 1

与本文相关的文章

评论列表(0)