最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - merge based on id and dates with missing data - Stack Overflow

programmeradmin0浏览0评论

I'm attempting to merge datasets based on drug and dates.

The left (df1) is a long claim dataset where people (id) can appear multiple times with different claim ids, drugs (drug), and dates (claim_date).

The right (df2) is a long dataset of starting dates (price_date) for effective drug prices (price). The price is effective until there is a new date.

I want to keep all records on the left (df1). Some drugs like 40 do not have pricing data in df2. Other drug have claims that occur before the first effective price. These should all result in NA when merged.

The real dataset has a few hundred thousand claims and maybe 40k pricing rows, so my tidyverse approaches were slow and not correct. I don't know data.table well, but I kept getting incorrect matches and unexpected NAs for the left hand side.

library(tidyverse)

# claims
df1 <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
              claim = c("a", "b", "c", "d", "e", "f", "g"),
              drug = c(10, 10, 20, 30, 31, 32, 40),
              claim_date = ymd("2024-01-01", "2024-02-01",
                               "2024-01-01",
                               "2024-01-01", "2024-02-01", "2024-02-10",
                               "2024-03-13"))

# price
# no price data for drug 4
df2 <- tibble(drug = c(rep(10, 3), rep(20, 3), rep(30, 3),
                       rep(31, 3), rep(32, 3)),
              price = c(1, 2, 3,     #10
                        4, 5, 6,     #20
                        7, 8, 9,     #30
                        10, 11, 12,  #31
                        13, 14, 15   #32
                        ),
              price_date = ymd("2024-01-01", "2024-02-01", "2024-03-01", #10
                               "2024-01-10", "2024-02-11", "2024-03-12", #20
                               "2024-01-20", "2024-02-21", "2024-03-22", #30
                               "2024-01-20", "2024-02-21", "2024-03-22", #31
                               "2024-01-20", "2024-02-21", "2024-03-22") #32
              )

result <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
                 claim = c("a", "b", "c", "d", "e", "f", "g"),
                 drug = c(10, 10, 20, 30, 31, 32, 40),
                 claim_date = ymd("2024-01-01", "2024-02-01", #10
                                  "2024-01-01", #20
                                  "2024-01-01", #30
                                  "2024-02-01", #31
                                  "2024-02-10", #32
                                  "2024-03-13"  #40
                                  ),
                 price = c(1, 2,     # 10
                           NA,       # 20; claim before first price date
                           NA,       # 30
                           10,       # 31
                           13,       # 32
                           NA        # 40; no drug price info
                           ))
df1
# A tibble: 7 × 4
     id claim  drug claim_date
  <dbl> <chr> <dbl> <date>    
1     1 a        10 2024-01-01
2     1 b        10 2024-02-01
3     2 c        20 2024-01-01
4     3 d        30 2024-01-01
5     3 e        31 2024-02-01
6     3 f        32 2024-02-10
7     4 g        40 2024-03-13

# A tibble: 15 × 3
    drug price price_date
   <dbl> <dbl> <date>    
 1    10     1 2024-01-01
 2    10     2 2024-02-01
 3    10     3 2024-03-01
 4    20     4 2024-01-10
 5    20     5 2024-02-11
 6    20     6 2024-03-12
 7    30     7 2024-01-20
 8    30     8 2024-02-21
 9    30     9 2024-03-22
10    31    10 2024-01-20
11    31    11 2024-02-21
12    31    12 2024-03-22
13    32    13 2024-01-20
14    32    14 2024-02-21
15    32    15 2024-03-22

> result
# A tibble: 7 × 5
     id claim  drug claim_date price
  <dbl> <chr> <dbl> <date>     <dbl>
1     1 a        10 2024-01-01     1
2     1 b        10 2024-02-01     2
3     2 c        20 2024-01-01    NA
4     3 d        30 2024-01-01    NA
5     3 e        31 2024-02-01    10
6     3 f        32 2024-02-10    13
7     4 g        40 2024-03-13    NA

I'm attempting to merge datasets based on drug and dates.

The left (df1) is a long claim dataset where people (id) can appear multiple times with different claim ids, drugs (drug), and dates (claim_date).

The right (df2) is a long dataset of starting dates (price_date) for effective drug prices (price). The price is effective until there is a new date.

I want to keep all records on the left (df1). Some drugs like 40 do not have pricing data in df2. Other drug have claims that occur before the first effective price. These should all result in NA when merged.

The real dataset has a few hundred thousand claims and maybe 40k pricing rows, so my tidyverse approaches were slow and not correct. I don't know data.table well, but I kept getting incorrect matches and unexpected NAs for the left hand side.

library(tidyverse)

# claims
df1 <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
              claim = c("a", "b", "c", "d", "e", "f", "g"),
              drug = c(10, 10, 20, 30, 31, 32, 40),
              claim_date = ymd("2024-01-01", "2024-02-01",
                               "2024-01-01",
                               "2024-01-01", "2024-02-01", "2024-02-10",
                               "2024-03-13"))

# price
# no price data for drug 4
df2 <- tibble(drug = c(rep(10, 3), rep(20, 3), rep(30, 3),
                       rep(31, 3), rep(32, 3)),
              price = c(1, 2, 3,     #10
                        4, 5, 6,     #20
                        7, 8, 9,     #30
                        10, 11, 12,  #31
                        13, 14, 15   #32
                        ),
              price_date = ymd("2024-01-01", "2024-02-01", "2024-03-01", #10
                               "2024-01-10", "2024-02-11", "2024-03-12", #20
                               "2024-01-20", "2024-02-21", "2024-03-22", #30
                               "2024-01-20", "2024-02-21", "2024-03-22", #31
                               "2024-01-20", "2024-02-21", "2024-03-22") #32
              )

result <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
                 claim = c("a", "b", "c", "d", "e", "f", "g"),
                 drug = c(10, 10, 20, 30, 31, 32, 40),
                 claim_date = ymd("2024-01-01", "2024-02-01", #10
                                  "2024-01-01", #20
                                  "2024-01-01", #30
                                  "2024-02-01", #31
                                  "2024-02-10", #32
                                  "2024-03-13"  #40
                                  ),
                 price = c(1, 2,     # 10
                           NA,       # 20; claim before first price date
                           NA,       # 30
                           10,       # 31
                           13,       # 32
                           NA        # 40; no drug price info
                           ))
df1
# A tibble: 7 × 4
     id claim  drug claim_date
  <dbl> <chr> <dbl> <date>    
1     1 a        10 2024-01-01
2     1 b        10 2024-02-01
3     2 c        20 2024-01-01
4     3 d        30 2024-01-01
5     3 e        31 2024-02-01
6     3 f        32 2024-02-10
7     4 g        40 2024-03-13

# A tibble: 15 × 3
    drug price price_date
   <dbl> <dbl> <date>    
 1    10     1 2024-01-01
 2    10     2 2024-02-01
 3    10     3 2024-03-01
 4    20     4 2024-01-10
 5    20     5 2024-02-11
 6    20     6 2024-03-12
 7    30     7 2024-01-20
 8    30     8 2024-02-21
 9    30     9 2024-03-22
10    31    10 2024-01-20
11    31    11 2024-02-21
12    31    12 2024-03-22
13    32    13 2024-01-20
14    32    14 2024-02-21
15    32    15 2024-03-22

> result
# A tibble: 7 × 5
     id claim  drug claim_date price
  <dbl> <chr> <dbl> <date>     <dbl>
1     1 a        10 2024-01-01     1
2     1 b        10 2024-02-01     2
3     2 c        20 2024-01-01    NA
4     3 d        30 2024-01-01    NA
5     3 e        31 2024-02-01    10
6     3 f        32 2024-02-10    13
7     4 g        40 2024-03-13    NA
Share Improve this question edited Nov 20, 2024 at 11:58 Eric Green asked Nov 20, 2024 at 11:10 Eric GreenEric Green 7,73511 gold badges63 silver badges112 bronze badges 4
  • Sorry, do not understand. Would have thought it ought to be close to df1 |> left_join(df2, by = join_by(drug, claim_date >= price_date)) |> select(-price_date), but I do not get the logic of the desired result, i.e. which row ob b should be deleted after the join. – Friede Commented Nov 20, 2024 at 12:16
  • claim b is for drug 10. the claim date of 2024-02-01 would have a price of 2 since there is an effective date for drug 10 that also starts on 2024-02-01. the old price of 1 expires when there is a new price, so in your result we'd drop row 2 that shows a price of 1 – Eric Green Commented Nov 20, 2024 at 12:24
  • Claims are unique? – Friede Commented Nov 20, 2024 at 12:46
  • 1 yes, claims are unique – Eric Green Commented Nov 20, 2024 at 12:49
Add a comment  | 

2 Answers 2

Reset to default 5

Create an end_date for the prices as well as a start date, then you can join on both conditions to find the correct price:

df2 = df2 |> 
  mutate(end_date = lead(price_date, default = ymd("3000-01-01")), .by = drug)

df1 |> left_join(df2, by = join_by(
  drug, 
  claim_date >= price_date, 
  claim_date < end_date
  )) |> 
  select(-price_date, -end_date)
# # A tibble: 7 × 5
#      id claim  drug claim_date price
#   <dbl> <chr> <dbl> <date>     <dbl>
# 1     1 a        10 2024-01-01     1
# 2     1 b        10 2024-02-01     2
# 3     2 c        20 2024-01-01    NA
# 4     3 d        30 2024-01-01    NA
# 5     3 e        31 2024-02-01    10
# 6     3 f        32 2024-02-10    13
# 7     4 g        40 2024-03-13    NA

If df2 is ordered, perhaps you can use a temporary variable i:

library(dplyr)
df1 |> 
  left_join(mutate(df2, i=row_number()), by=join_by(drug, claim_date>=price_date)) |> 
  slice_max(i, n=1, by=claim) |>
  select(-price_date, -i) 

giving

# A tibble: 7 × 5
     id claim  drug claim_date price
  <dbl> <chr> <dbl> <date>     <dbl>
1     1 a        10 2024-01-01     1
2     1 b        10 2024-02-01     2
3     2 c        20 2024-01-01    NA
4     3 d        30 2024-01-01    NA
5     3 e        31 2024-02-01    10
6     3 f        32 2024-02-10    13
7     4 g        40 2024-03-13    NA
发布评论

评论列表(0)

  1. 暂无评论