最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How to detect overlapping date ranges by ID in R using data.table? - Stack Overflow

programmeradmin5浏览0评论

I have a dataset in R using data.table where each ID represents a client, and each client has multiple contracts with a start date PERIODSTART and an end date PERIODEND.

I need to detect overlapping periods for the same ID, meaning that if a client has two or more contracts that overlap, I want to extract those cases.

What I need is to calculate the exposure (time as a percentage over a year) for each contract period, grouped by ID and PRODUCTNUMBER. My problem is when there are dates overlapping. I don't want to calcule exposure several times.

data = data.table(ID = c(rep("customer_1000", 7), rep("customer_78", 4), rep("customer_1047", 4)),
                  PERIODSTART = as.Date(c("1991-02-03", "1991-11-06", "1993-02-03", "1993-03-11", "1996-02-03", "1996-11-28", "1996-11-29",
                                          "2021-02-17", "2021-06-10", "2021-12-03", "2021-06-28",
                                          "2021-02-17", "2021-02-17", "2021-05-02", "2021-06-28")),
                  PERIODEND = as.Date(c("1991-11-06", "1992-02-03", "1993-03-11", "1994-02-03", "1996-11-28", "1996-11-29", "1997-02-03",
                                          "2021-09-30", "2021-11-09", "2021-12-20", "2021-10-14",
                                          "2021-09-30", "2021-08-01", "2021-06-17", "2021-10-14")),
                  PRODUCTNUMBER = c(rep("product_1", 7), 
                                    "product_74", "product_88", "product_76", "product_25",
                                    "product_1", "product_2", "product_3", "product_4")
                  )

data[, year := year(PERIODSTART)]

The calculation I want :







I have a dataset in R using data.table where each ID represents a client, and each client has multiple contracts with a start date PERIODSTART and an end date PERIODEND.

I need to detect overlapping periods for the same ID, meaning that if a client has two or more contracts that overlap, I want to extract those cases.

What I need is to calculate the exposure (time as a percentage over a year) for each contract period, grouped by ID and PRODUCTNUMBER. My problem is when there are dates overlapping. I don't want to calcule exposure several times.

data = data.table(ID = c(rep("customer_1000", 7), rep("customer_78", 4), rep("customer_1047", 4)),
                  PERIODSTART = as.Date(c("1991-02-03", "1991-11-06", "1993-02-03", "1993-03-11", "1996-02-03", "1996-11-28", "1996-11-29",
                                          "2021-02-17", "2021-06-10", "2021-12-03", "2021-06-28",
                                          "2021-02-17", "2021-02-17", "2021-05-02", "2021-06-28")),
                  PERIODEND = as.Date(c("1991-11-06", "1992-02-03", "1993-03-11", "1994-02-03", "1996-11-28", "1996-11-29", "1997-02-03",
                                          "2021-09-30", "2021-11-09", "2021-12-20", "2021-10-14",
                                          "2021-09-30", "2021-08-01", "2021-06-17", "2021-10-14")),
                  PRODUCTNUMBER = c(rep("product_1", 7), 
                                    "product_74", "product_88", "product_76", "product_25",
                                    "product_1", "product_2", "product_3", "product_4")
                  )

data[, year := year(PERIODSTART)]

The calculation I want :







Share Improve this question edited Mar 19 at 19:44 nimliug asked Mar 19 at 18:32 nimliugnimliug 4652 silver badges17 bronze badges 4
  • Have you looked at existing questions like stackoverflow/questions/58152986/… or stackoverflow/questions/45558642/…? foverlaps or IRanges will probably help. – MrFlick Commented Mar 19 at 18:38
  • This isn't really helping because what I need is to calculate the exposure (time as a percentage over a year) for each contract period, grouped by ID and PRODUCTNUMBER. I'm not sure how to approach this. – nimliug Commented Mar 19 at 19:03
  • 2 That's not clear from the question. Can you provide the desired output for the sample input so possible solutions can be tested and verified? – MrFlick Commented Mar 19 at 19:05
  • you're right, I have edited my post. Hope it's clearer now with the illustrations – nimliug Commented Mar 19 at 19:46
Add a comment  | 

1 Answer 1

Reset to default 1

Haven't coded {data.table} in a while. This rusty code might get you started:

library(data.table)
data[order(PERIODSTART), .(start=min(PERIODSTART), stop=max(PERIODEND)),
     by=.(ID, group=cumsum(c(1, tail(PERIODSTART, -1) > head(PERIODEND, -1))))][
       , {
         a = year(start)
         b = year(stop)
         y = seq(a, b)
         .(
           start = fifelse(y==a, start, as.Date(paste0(y, '-01-01'))), 
           stop = fifelse(y==b, stop, as.Date(paste0(y, '-12-31'))), 
           year = y
         )
       }, 
       by=.(ID, group)][
         , .(ID, year, start, stop, expi = round(as.integer(stop-start)/365.25, 2))]

where the first chain is quite famous. You will find it on several places on SO.

              ID  year      start       stop  expi
          <char> <int>     <Date>     <Date> <num>
1: customer_1000  1991 1991-02-03 1991-12-31  0.91
2: customer_1000  1992 1992-01-01 1992-02-03  0.09
3: customer_1000  1993 1993-02-03 1993-12-31  0.91
4: customer_1000  1994 1994-01-01 1994-02-03  0.09
5: customer_1000  1996 1996-02-03 1996-12-31  0.91
6: customer_1000  1997 1997-01-01 1997-02-03  0.09
7:   customer_78  2021 2021-02-17 2021-11-09  0.73
8: customer_1047  2021 2021-02-17 2021-10-14  0.65
9:   customer_78  2021 2021-12-03 2021-12-20  0.05

You might want to aggregate exp by year and ID, ignoring start and stop in an additional step. Note. You might want to add an accurate leap year routine.

发布评论

评论列表(0)

  1. 暂无评论