r - merge based on id and dates with missing data

I'm attempting to merge datasets based on drug and dates.

The left (df1) is a long claim dataset where people (id) can appear multiple times with different claim ids, drugs (drug), and dates (claim_date).

The right (df2) is a long dataset of starting dates (price_date) for effective drug prices (price). The price is effective until there is a new date.

I want to keep all records on the left (df1). Some drugs like 40 do not have pricing data in df2. Other drug have claims that occur before the first effective price. These should all result in NA when merged.

The real dataset has a few hundred thousand claims and maybe 40k pricing rows, so my tidyverse approaches were slow and not correct. I don't know data.table well, but I kept getting incorrect matches and unexpected NAs for the left hand side.

library(tidyverse)

# claims
df1 <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
              claim = c("a", "b", "c", "d", "e", "f", "g"),
              drug = c(10, 10, 20, 30, 31, 32, 40),
              claim_date = ymd("2024-01-01", "2024-02-01",
                               "2024-01-01",
                               "2024-01-01", "2024-02-01", "2024-02-10",
                               "2024-03-13"))

# price
# no price data for drug 4
df2 <- tibble(drug = c(rep(10, 3), rep(20, 3), rep(30, 3),
                       rep(31, 3), rep(32, 3)),
              price = c(1, 2, 3,     #10
                        4, 5, 6,     #20
                        7, 8, 9,     #30
                        10, 11, 12,  #31
                        13, 14, 15   #32
                        ),
              price_date = ymd("2024-01-01", "2024-02-01", "2024-03-01", #10
                               "2024-01-10", "2024-02-11", "2024-03-12", #20
                               "2024-01-20", "2024-02-21", "2024-03-22", #30
                               "2024-01-20", "2024-02-21", "2024-03-22", #31
                               "2024-01-20", "2024-02-21", "2024-03-22") #32
              )

result <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
                 claim = c("a", "b", "c", "d", "e", "f", "g"),
                 drug = c(10, 10, 20, 30, 31, 32, 40),
                 claim_date = ymd("2024-01-01", "2024-02-01", #10
                                  "2024-01-01", #20
                                  "2024-01-01", #30
                                  "2024-02-01", #31
                                  "2024-02-10", #32
                                  "2024-03-13"  #40
                                  ),
                 price = c(1, 2,     # 10
                           NA,       # 20; claim before first price date
                           NA,       # 30
                           10,       # 31
                           13,       # 32
                           NA        # 40; no drug price info
                           ))

df1
# A tibble: 7 × 4
     id claim  drug claim_date
  <dbl> <chr> <dbl> <date>    
1     1 a        10 2024-01-01
2     1 b        10 2024-02-01
3     2 c        20 2024-01-01
4     3 d        30 2024-01-01
5     3 e        31 2024-02-01
6     3 f        32 2024-02-10
7     4 g        40 2024-03-13

# A tibble: 15 × 3
    drug price price_date
   <dbl> <dbl> <date>    
 1    10     1 2024-01-01
 2    10     2 2024-02-01
 3    10     3 2024-03-01
 4    20     4 2024-01-10
 5    20     5 2024-02-11
 6    20     6 2024-03-12
 7    30     7 2024-01-20
 8    30     8 2024-02-21
 9    30     9 2024-03-22
10    31    10 2024-01-20
11    31    11 2024-02-21
12    31    12 2024-03-22
13    32    13 2024-01-20
14    32    14 2024-02-21
15    32    15 2024-03-22

> result
# A tibble: 7 × 5
     id claim  drug claim_date price
  <dbl> <chr> <dbl> <date>     <dbl>
1     1 a        10 2024-01-01     1
2     1 b        10 2024-02-01     2
3     2 c        20 2024-01-01    NA
4     3 d        30 2024-01-01    NA
5     3 e        31 2024-02-01    10
6     3 f        32 2024-02-10    13
7     4 g        40 2024-03-13    NA

I'm attempting to merge datasets based on drug and dates.

The left (df1) is a long claim dataset where people (id) can appear multiple times with different claim ids, drugs (drug), and dates (claim_date).

The right (df2) is a long dataset of starting dates (price_date) for effective drug prices (price). The price is effective until there is a new date.

library(tidyverse)

# claims
df1 <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
              claim = c("a", "b", "c", "d", "e", "f", "g"),
              drug = c(10, 10, 20, 30, 31, 32, 40),
              claim_date = ymd("2024-01-01", "2024-02-01",
                               "2024-01-01",
                               "2024-01-01", "2024-02-01", "2024-02-10",
                               "2024-03-13"))

# price
# no price data for drug 4
df2 <- tibble(drug = c(rep(10, 3), rep(20, 3), rep(30, 3),
                       rep(31, 3), rep(32, 3)),
              price = c(1, 2, 3,     #10
                        4, 5, 6,     #20
                        7, 8, 9,     #30
                        10, 11, 12,  #31
                        13, 14, 15   #32
                        ),
              price_date = ymd("2024-01-01", "2024-02-01", "2024-03-01", #10
                               "2024-01-10", "2024-02-11", "2024-03-12", #20
                               "2024-01-20", "2024-02-21", "2024-03-22", #30
                               "2024-01-20", "2024-02-21", "2024-03-22", #31
                               "2024-01-20", "2024-02-21", "2024-03-22") #32
              )

result <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
                 claim = c("a", "b", "c", "d", "e", "f", "g"),
                 drug = c(10, 10, 20, 30, 31, 32, 40),
                 claim_date = ymd("2024-01-01", "2024-02-01", #10
                                  "2024-01-01", #20
                                  "2024-01-01", #30
                                  "2024-02-01", #31
                                  "2024-02-10", #32
                                  "2024-03-13"  #40
                                  ),
                 price = c(1, 2,     # 10
                           NA,       # 20; claim before first price date
                           NA,       # 30
                           10,       # 31
                           13,       # 32
                           NA        # 40; no drug price info
                           ))

df1
# A tibble: 7 × 4
     id claim  drug claim_date
  <dbl> <chr> <dbl> <date>    
1     1 a        10 2024-01-01
2     1 b        10 2024-02-01
3     2 c        20 2024-01-01
4     3 d        30 2024-01-01
5     3 e        31 2024-02-01
6     3 f        32 2024-02-10
7     4 g        40 2024-03-13

# A tibble: 15 × 3
    drug price price_date
   <dbl> <dbl> <date>    
 1    10     1 2024-01-01
 2    10     2 2024-02-01
 3    10     3 2024-03-01
 4    20     4 2024-01-10
 5    20     5 2024-02-11
 6    20     6 2024-03-12
 7    30     7 2024-01-20
 8    30     8 2024-02-21
 9    30     9 2024-03-22
10    31    10 2024-01-20
11    31    11 2024-02-21
12    31    12 2024-03-22
13    32    13 2024-01-20
14    32    14 2024-02-21
15    32    15 2024-03-22

> result
# A tibble: 7 × 5
     id claim  drug claim_date price
  <dbl> <chr> <dbl> <date>     <dbl>
1     1 a        10 2024-01-01     1
2     1 b        10 2024-02-01     2
3     2 c        20 2024-01-01    NA
4     3 d        30 2024-01-01    NA
5     3 e        31 2024-02-01    10
6     3 f        32 2024-02-10    13
7     4 g        40 2024-03-13    NA

Share Improve this question edited Nov 20, 2024 at 11:58 asked Nov 20, 2024 at 11:10 Eric Green 7,73511 gold badges63 silver badges112 bronze badges

Sorry, do not understand. Would have thought it ought to be close to df1 |> left_join(df2, by = join_by(drug, claim_date >= price_date)) |> select(-price_date), but I do not get the logic of the desired result, i.e. which row ob b should be deleted after the join. – Friede Commented Nov 20, 2024 at 12:16
claim b is for drug 10. the claim date of 2024-02-01 would have a price of 2 since there is an effective date for drug 10 that also starts on 2024-02-01. the old price of 1 expires when there is a new price, so in your result we'd drop row 2 that shows a price of 1 – Eric Green Commented Nov 20, 2024 at 12:24
Claims are unique? – Friede Commented Nov 20, 2024 at 12:46
1 yes, claims are unique – Eric Green Commented Nov 20, 2024 at 12:49

Add a comment |

2 Answers 2

Sorted by: Reset to default 5

Create an end_date for the prices as well as a start date, then you can join on both conditions to find the correct price:

df2 = df2 |> 
  mutate(end_date = lead(price_date, default = ymd("3000-01-01")), .by = drug)

df1 |> left_join(df2, by = join_by(
  drug, 
  claim_date >= price_date, 
  claim_date < end_date
  )) |> 
  select(-price_date, -end_date)
# # A tibble: 7 × 5
#      id claim  drug claim_date price
#   <dbl> <chr> <dbl> <date>     <dbl>
# 1     1 a        10 2024-01-01     1
# 2     1 b        10 2024-02-01     2
# 3     2 c        20 2024-01-01    NA
# 4     3 d        30 2024-01-01    NA
# 5     3 e        31 2024-02-01    10
# 6     3 f        32 2024-02-10    13
# 7     4 g        40 2024-03-13    NA

If df2 is ordered, perhaps you can use a temporary variable i:

library(dplyr)
df1 |> 
  left_join(mutate(df2, i=row_number()), by=join_by(drug, claim_date>=price_date)) |> 
  slice_max(i, n=1, by=claim) |>
  select(-price_date, -i)

giving

# A tibble: 7 × 5
     id claim  drug claim_date price
  <dbl> <chr> <dbl> <date>     <dbl>
1     1 a        10 2024-01-01     1
2     1 b        10 2024-02-01     2
3     2 c        20 2024-01-01    NA
4     3 d        30 2024-01-01    NA
5     3 e        31 2024-02-01    10
6     3 f        32 2024-02-10    13
7     4 g        40 2024-03-13    NA

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - merge based on id and dates with missing data - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)