r - cleaner way to match observations using IDs in different columns

I am using mark-recapture data, with each individual having a PIT tag (Tag) and a genetic sample ID given at the first observation (ID), which is typically used as the ID for the individual (Unified.ID). However, sometimes an animal will lose a PIT tag, and we have to retag it, but if we know its old tag number we record that as Alt_tag.

This sample data is four observations from the same individual, so I am trying to get them all to have the same Unified.ID by matching the Alt_tag in row 4 to the Tag in rows 1:3, so that all 4 rows end up with "CAL163037" in the Unified.ID column.

sample.dat<-tibble(ID=c("CAL163037",NA,NA,NA),
           Tag = c("132800","132800","132800","981930"),
           Date = c("2016-08-23","2017-06-09","2017-06-22","2017-08-23"),
           Alt_tag = c(NA,NA,NA,"132800"),
           Unified.ID = c("CAL163037","CAL163037","CAL163037","981930"))

# A tibble: 4 × 5
  ID        Tag    Date       Alt_tag Unified.ID
  <chr>     <chr>  <chr>      <chr>   <chr>     
1 CAL163037 132800 2016-08-23 NA      CAL163037 
2 NA        132800 2017-06-09 NA      CAL163037 
3 NA        132800 2017-06-22 NA      CAL163037 
4 NA        981930 2017-08-23 132800  981930

I think I sort of have something close with the code below, but I feel like there is a more elegant way to do this, preferably with some kind of dplyr mutate, so it is easier to check that it is doing what I want it to (I can't get this if_else function working inside of mutate()). My full dataset is pretty big, so I want to be confident that it is working properly across the whole thing. Any ideas of how to clean this up?

> if_else(is.na(sample.dat$Alt_tag)==FALSE,(sample.dat[
which(sample.dat$Tag%in%sample.dat$Alt_tag),"Unified.ID"]%>%distinct()),NA)

# A tibble: 4 × 1
# Groups:   Unified.ID [1]
  Unified.ID
  <fct>     
1 CAL163037 
2 CAL163037 
3 CAL163037 
4 CAL163037

sample.dat<-tibble(ID=c("CAL163037",NA,NA,NA),
           Tag = c("132800","132800","132800","981930"),
           Date = c("2016-08-23","2017-06-09","2017-06-22","2017-08-23"),
           Alt_tag = c(NA,NA,NA,"132800"),
           Unified.ID = c("CAL163037","CAL163037","CAL163037","981930"))

# A tibble: 4 × 5
  ID        Tag    Date       Alt_tag Unified.ID
  <chr>     <chr>  <chr>      <chr>   <chr>     
1 CAL163037 132800 2016-08-23 NA      CAL163037 
2 NA        132800 2017-06-09 NA      CAL163037 
3 NA        132800 2017-06-22 NA      CAL163037 
4 NA        981930 2017-08-23 132800  981930

> if_else(is.na(sample.dat$Alt_tag)==FALSE,(sample.dat[
which(sample.dat$Tag%in%sample.dat$Alt_tag),"Unified.ID"]%>%distinct()),NA)

# A tibble: 4 × 1
# Groups:   Unified.ID [1]
  Unified.ID
  <fct>     
1 CAL163037 
2 CAL163037 
3 CAL163037 
4 CAL163037

Share Improve this question asked Feb 17 at 12:22 ghaines 836 bronze badges

1 What is "clean"? Base R? – Friede Commented Feb 17 at 13:19
1 One way to clean up your code is to replace is.na(sample.dat$Alt_tag)==FALSE with !is.na(sample.dat$Alt_tag) – jpsmith Commented Feb 17 at 14:30
@Friede what I mean is that the above solution returns a vector which has to replace the original vector, and it also has a lot of nested functions that make it difficult to read (%in% within which() within an index before distinct() within if_else()), it's very roundabout and inefficient. – ghaines Commented Feb 17 at 15:07

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

Probably you can try this?

sample.dat %>%
    mutate(Unified.Tag = coalesce(Alt_tag, Tag)) %>%
    mutate(Unified.ID = unique(na.omit(ID)), .by = Unified.Tag) %>%
    select(-Unified.Tag)

which gives

# A tibble: 4 × 5
  ID        Tag    Date       Alt_tag Unified.ID
  <chr>     <chr>  <chr>      <chr>   <chr>
1 CAL163037 132800 2016-08-23 NA      CAL163037
2 NA        132800 2017-06-09 NA      CAL163037
3 NA        132800 2017-06-22 NA      CAL163037
4 NA        981930 2017-08-23 132800  CAL163037

Check if the Alt_tag matches a Tag. If it does use the row number as returned by match to index into Unified.ID; otherwise, match will return NA so use the current row_number() to so index. Note that coalesce returns its first non-NA argument.

library(dplyr)

sample.dat %>%
  mutate(Unified.ID = Unified.ID[coalesce(match(Alt_tag, Tag), row_number())])

giving

# A tibble: 4 × 5
  ID        Tag    Date       Alt_tag Unified.ID
  <chr>     <chr>  <chr>      <chr>   <chr>     
1 CAL163037 132800 2016-08-23 <NA>    CAL163037 
2 <NA>      132800 2017-06-09 <NA>    CAL163037 
3 <NA>      132800 2017-06-22 <NA>    CAL163037 
4 <NA>      981930 2017-08-23 132800  CAL163037

Note that with the data in the question it runs quite fast:

library(microbenchmark)\

microbenchmark(
  A = sample.dat %>%
    mutate(Unified.ID = Unified.ID[coalesce(match(Alt_tag, Tag), row_number())]),
  B = sample.dat %>%
    mutate(Unified.Tag = coalesce(Alt_tag, Tag)) %>%
    mutate(Unified.ID = unique(na.omit(ID)), .by = Unified.Tag) %>%
    select(-Unified.Tag)
)

## Unit: milliseconds
##  expr    min      lq     mean  median       uq     max neval cld
##     A 2.9443 3.12035 3.418708 3.22425  3.79555  4.5942   100  a 
##     B 8.8230 9.19820 9.997262 9.48225 10.99610 13.7240   100   b

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - cleaner way to match observations using IDs in different columns - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)