How to detect overlapping date ranges by ID in R using data.table?

I have a dataset in R using data.table where each ID represents a client, and each client has multiple contracts with a start date PERIODSTART and an end date PERIODEND.

I need to detect overlapping periods for the same ID, meaning that if a client has two or more contracts that overlap, I want to extract those cases.

What I need is to calculate the exposure (time as a percentage over a year) for each contract period, grouped by ID and PRODUCTNUMBER. My problem is when there are dates overlapping. I don't want to calcule exposure several times.

data = data.table(ID = c(rep("customer_1000", 7), rep("customer_78", 4), rep("customer_1047", 4)),
                  PERIODSTART = as.Date(c("1991-02-03", "1991-11-06", "1993-02-03", "1993-03-11", "1996-02-03", "1996-11-28", "1996-11-29",
                                          "2021-02-17", "2021-06-10", "2021-12-03", "2021-06-28",
                                          "2021-02-17", "2021-02-17", "2021-05-02", "2021-06-28")),
                  PERIODEND = as.Date(c("1991-11-06", "1992-02-03", "1993-03-11", "1994-02-03", "1996-11-28", "1996-11-29", "1997-02-03",
                                          "2021-09-30", "2021-11-09", "2021-12-20", "2021-10-14",
                                          "2021-09-30", "2021-08-01", "2021-06-17", "2021-10-14")),
                  PRODUCTNUMBER = c(rep("product_1", 7), 
                                    "product_74", "product_88", "product_76", "product_25",
                                    "product_1", "product_2", "product_3", "product_4")
                  )

data[, year := year(PERIODSTART)]

The calculation I want :

I have a dataset in R using data.table where each ID represents a client, and each client has multiple contracts with a start date PERIODSTART and an end date PERIODEND.

I need to detect overlapping periods for the same ID, meaning that if a client has two or more contracts that overlap, I want to extract those cases.

data = data.table(ID = c(rep("customer_1000", 7), rep("customer_78", 4), rep("customer_1047", 4)),
                  PERIODSTART = as.Date(c("1991-02-03", "1991-11-06", "1993-02-03", "1993-03-11", "1996-02-03", "1996-11-28", "1996-11-29",
                                          "2021-02-17", "2021-06-10", "2021-12-03", "2021-06-28",
                                          "2021-02-17", "2021-02-17", "2021-05-02", "2021-06-28")),
                  PERIODEND = as.Date(c("1991-11-06", "1992-02-03", "1993-03-11", "1994-02-03", "1996-11-28", "1996-11-29", "1997-02-03",
                                          "2021-09-30", "2021-11-09", "2021-12-20", "2021-10-14",
                                          "2021-09-30", "2021-08-01", "2021-06-17", "2021-10-14")),
                  PRODUCTNUMBER = c(rep("product_1", 7), 
                                    "product_74", "product_88", "product_76", "product_25",
                                    "product_1", "product_2", "product_3", "product_4")
                  )

data[, year := year(PERIODSTART)]

The calculation I want :

Share Improve this question edited Mar 19 at 19:44 asked Mar 19 at 18:32 nimliug 4652 silver badges17 bronze badges

Have you looked at existing questions like stackoverflow/questions/58152986/… or stackoverflow/questions/45558642/…? foverlaps or IRanges will probably help. – MrFlick Commented Mar 19 at 18:38
This isn't really helping because what I need is to calculate the exposure (time as a percentage over a year) for each contract period, grouped by ID and PRODUCTNUMBER. I'm not sure how to approach this. – nimliug Commented Mar 19 at 19:03
2 That's not clear from the question. Can you provide the desired output for the sample input so possible solutions can be tested and verified? – MrFlick Commented Mar 19 at 19:05
you're right, I have edited my post. Hope it's clearer now with the illustrations – nimliug Commented Mar 19 at 19:46

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Haven't coded {data.table} in a while. This rusty code might get you started:

library(data.table)
data[order(PERIODSTART), .(start=min(PERIODSTART), stop=max(PERIODEND)),
     by=.(ID, group=cumsum(c(1, tail(PERIODSTART, -1) > head(PERIODEND, -1))))][
       , {
         a = year(start)
         b = year(stop)
         y = seq(a, b)
         .(
           start = fifelse(y==a, start, as.Date(paste0(y, '-01-01'))), 
           stop = fifelse(y==b, stop, as.Date(paste0(y, '-12-31'))), 
           year = y
         )
       }, 
       by=.(ID, group)][
         , .(ID, year, start, stop, expi = round(as.integer(stop-start)/365.25, 2))]

where the first chain is quite famous. You will find it on several places on SO.

              ID  year      start       stop  expi
          <char> <int>     <Date>     <Date> <num>
1: customer_1000  1991 1991-02-03 1991-12-31  0.91
2: customer_1000  1992 1992-01-01 1992-02-03  0.09
3: customer_1000  1993 1993-02-03 1993-12-31  0.91
4: customer_1000  1994 1994-01-01 1994-02-03  0.09
5: customer_1000  1996 1996-02-03 1996-12-31  0.91
6: customer_1000  1997 1997-01-01 1997-02-03  0.09
7:   customer_78  2021 2021-02-17 2021-11-09  0.73
8: customer_1047  2021 2021-02-17 2021-10-14  0.65
9:   customer_78  2021 2021-12-03 2021-12-20  0.05

You might want to aggregate exp by year and ID, ignoring start and stop in an additional step. Note. You might want to add an accurate leap year routine.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

How to detect overlapping date ranges by ID in R using data.table? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)