最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How to merge a dataset of 80,000 species with Red List data from several assessments when species names do not coincide - Stack

programmeradmin2浏览0评论

I need to merge several datasets (including lists mentioned in the title) in order to estimate the gap in IUCN species assessments in Italy. My main dataset is a 20-year-old checklist of all fauna species present in Italy, and I would like to merge it by species and subspecies name to national, European and global red lists and also policy assessments and priority lists. Being this old, my reference checklist is likely to have many outdated species names as well as subspecies names that will appear different in the Red List datasets.

I have tried using merge tools in Excel or fuzzyjoin in RStudio, however this are not as efficient as I would like them to be, especially for those species that have drastically changed name along the years.

Such a long species list (between 60,000 and 80,000 species across the main checklist and some others) is unlikely to be addressed manually to check for merging errors, so I ask here if there are machine learning methods in Python (or any other method that I have not listed above) to merge datasets with a high level of accuracy.

Thank you!

P.S. I have already considered to subdivide the datasets between species and subspecies and then merge accordingly at least to reduce the mistake rate. Do you recommend this option?

I need to merge several datasets (including lists mentioned in the title) in order to estimate the gap in IUCN species assessments in Italy. My main dataset is a 20-year-old checklist of all fauna species present in Italy, and I would like to merge it by species and subspecies name to national, European and global red lists and also policy assessments and priority lists. Being this old, my reference checklist is likely to have many outdated species names as well as subspecies names that will appear different in the Red List datasets.

I have tried using merge tools in Excel or fuzzyjoin in RStudio, however this are not as efficient as I would like them to be, especially for those species that have drastically changed name along the years.

Such a long species list (between 60,000 and 80,000 species across the main checklist and some others) is unlikely to be addressed manually to check for merging errors, so I ask here if there are machine learning methods in Python (or any other method that I have not listed above) to merge datasets with a high level of accuracy.

Thank you!

P.S. I have already considered to subdivide the datasets between species and subspecies and then merge accordingly at least to reduce the mistake rate. Do you recommend this option?

Share Improve this question asked Jan 20 at 17:59 Ema95Ema95 1
Add a comment  | 

1 Answer 1

Reset to default 0

I'm unaware of a ML approach to any problems like this, but the taxize library in R might be able to help. Here is the github repo for the library. Specifically, there is the gnr_resolve() function, which might be a good fit for your use case. See documentation for that function here.

Another library that might be of use is taxdb, which says it can be used to resolve taxonomic names to identifiers, but I've never used it. Repo here.

You could also loop through the species emitting how many missing entries you have so you have an idea of error rate in your dataset by taxa.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论