I am wanting to split a survival dataset at event times. So in the simple case of one row of data per person, convert that to counting process form (multiple rows per person) where each person's observation time is split at ALL subjects event times (of course up to however long a person's observation time is).
I can do this easily with one row per person where a person has an observation time recorded and either a status of having had the event (1) or not - censored (0).
But I would also like to be able to do this with recurrent events data. In this case, each person has potentially multiple rows of data recording multiple events (the last time may be an event or be censored).
Using survSplit()
seems to expand the data by row, not ID (as I naively thought initially). Is there a way to do this so that the expanded dataset produced by survSplit()
only splits time within an individual - not within every event experienced by that individual?
Some example code below:
library(survival)
library(dplyr)
dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
age = c(43L, 43L, 43L, 43L, 43L, 43L, 43L, 41L, 41L, 41L, 41L),
treat = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
levels = c("old", "new"), class = "factor"),
time0 = c(0L, 6L, 9L, 56L, 0L, 42L, 87L, 0L, 15L, 17L, 36L),
time1 = c(6L, 9L, 56L, 88L, 42L, 87L, 91L, 15L, 17L, 36L, 112L),
status = c(1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0),
event = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L)),
datalabel = "Chapter 9 Exercises", time.stamp = " 7 Dec 1999 08:53", formats = c("%9.0g", "%9.0g", "%9.0g", "%9.0g", "%9.0g", "%19.0g", "%9.0g"),
types = c(105L, 98L, 98L, 105L, 105L, 98L, 98L),
val.labels = c("", "", "oldnew", "", "", "censor", ""),
var.labels = c("Subject Identification", "Age", "Treatment Assignment", "Time of Last Episode", "Time of Current Episode or censoring", "Indicator for Soreness Episode or censoring", "Soreness Episode Number"),
row.names = c("3", "4", "1", "2", "5", "7", "6", "8", "9", "10", "11"),
version = 6L, label.table = list(oldnew = structure(0:1, names = c("new", "old")),
censor = structure(0:1, names = c("censored", "experienced"))), class = "data.frame")
# Split at event times
event_times <- sort(unique(with(dat, time1[status == 1])))
# Create new df in CP form with splits at every event time
dat2 <- survSplit(Surv(time1, status) ~., dat, cut = event_times)
# This is NOT what I want as it expands by row (event) not ID.
Instead, below is a screenshot of the expanded dataset for the first 3 subjects as I would like to recreate. I have done this manually in Excel.
There does not seem to be a way to do this in survSplit()
, unless I have missed something?
I am wanting to split a survival dataset at event times. So in the simple case of one row of data per person, convert that to counting process form (multiple rows per person) where each person's observation time is split at ALL subjects event times (of course up to however long a person's observation time is).
I can do this easily with one row per person where a person has an observation time recorded and either a status of having had the event (1) or not - censored (0).
But I would also like to be able to do this with recurrent events data. In this case, each person has potentially multiple rows of data recording multiple events (the last time may be an event or be censored).
Using survSplit()
seems to expand the data by row, not ID (as I naively thought initially). Is there a way to do this so that the expanded dataset produced by survSplit()
only splits time within an individual - not within every event experienced by that individual?
Some example code below:
library(survival)
library(dplyr)
dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
age = c(43L, 43L, 43L, 43L, 43L, 43L, 43L, 41L, 41L, 41L, 41L),
treat = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
levels = c("old", "new"), class = "factor"),
time0 = c(0L, 6L, 9L, 56L, 0L, 42L, 87L, 0L, 15L, 17L, 36L),
time1 = c(6L, 9L, 56L, 88L, 42L, 87L, 91L, 15L, 17L, 36L, 112L),
status = c(1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0),
event = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L)),
datalabel = "Chapter 9 Exercises", time.stamp = " 7 Dec 1999 08:53", formats = c("%9.0g", "%9.0g", "%9.0g", "%9.0g", "%9.0g", "%19.0g", "%9.0g"),
types = c(105L, 98L, 98L, 105L, 105L, 98L, 98L),
val.labels = c("", "", "oldnew", "", "", "censor", ""),
var.labels = c("Subject Identification", "Age", "Treatment Assignment", "Time of Last Episode", "Time of Current Episode or censoring", "Indicator for Soreness Episode or censoring", "Soreness Episode Number"),
row.names = c("3", "4", "1", "2", "5", "7", "6", "8", "9", "10", "11"),
version = 6L, label.table = list(oldnew = structure(0:1, names = c("new", "old")),
censor = structure(0:1, names = c("censored", "experienced"))), class = "data.frame")
# Split at event times
event_times <- sort(unique(with(dat, time1[status == 1])))
# Create new df in CP form with splits at every event time
dat2 <- survSplit(Surv(time1, status) ~., dat, cut = event_times)
# This is NOT what I want as it expands by row (event) not ID.
Instead, below is a screenshot of the expanded dataset for the first 3 subjects as I would like to recreate. I have done this manually in Excel.
There does not seem to be a way to do this in survSplit()
, unless I have missed something?
2 Answers
Reset to default 4You can split the data on ID and then apply the survSplit
function to each and then combine the results together with map_dfr
from purrr.
library(purrr)
library(survival)
map_dfr(split(dat, dat$ID), \(x) {
survSplit(Surv(time0, time1, status) ~ .,
data=x,
cut=unique(dat$time1))})
ID age treat event time0 time1 status
1 1 43 new 1 0 6 1
2 1 43 new 2 6 9 1
3 1 43 new 3 9 15 0
4 1 43 new 3 15 17 0
5 1 43 new 3 17 36 0
6 1 43 new 3 36 42 0
7 1 43 new 3 42 56 1
8 1 43 new 4 56 87 0
9 1 43 new 4 87 88 1
10 2 43 new 1 0 6 0
11 2 43 new 1 6 9 0
12 2 43 new 1 9 15 0
13 2 43 new 1 15 17 0
14 2 43 new 1 17 36 0
15 2 43 new 1 36 42 1
16 2 43 new 2 42 56 0
17 2 43 new 2 56 87 1
18 2 43 new 3 87 88 0
19 2 43 new 3 88 91 0
20 3 41 new 1 0 6 0
21 3 41 new 1 6 9 0
22 3 41 new 1 9 15 1
23 3 41 new 2 15 17 1
24 3 41 new 3 17 36 1
25 3 41 new 4 36 42 0
26 3 41 new 4 42 56 0
27 3 41 new 4 56 87 0
28 3 41 new 4 87 88 0
29 3 41 new 4 88 91 0
30 3 41 new 4 91 112 0
Update: the last but one row was missing (row 29); I refined the code:
We write a small function split_single_row()
that
- takes a single‐row data frame
data_row
- subsets the global cut points
cutpoints
to those strictly inside the intervaltime0, time1
- calls
survSplit()
on just that row
Finally we apply row‐wise splitting for each subject with group_modify()
where
each .x
is the subset of rows for one subject (possibly multiple intervals).
Using map_dfr()
we split each row and combine the results within that group.
library(dplyr)
library(survival)
library(purrr)
split_single_row <- function(data_row, cutpoints) {
survival::survSplit(
formula = Surv(time0, time1, status) ~ .,
data = data_row,
cut = cutpoints,
start = "time0",
end = "time1",
event = "status"
)
}
dat %>%
group_modify(~ {
map_dfr(seq_len(nrow(.x)), function(i) {
split_single_row(.x[i, ], cutpoints = unique(dat$time1))
})
}, .by = ID)
ID age treat event time0 time1 status
1 1 43 new 1 0 6 1
2 1 43 new 2 6 9 1
3 1 43 new 3 9 15 0
4 1 43 new 3 15 17 0
5 1 43 new 3 17 36 0
6 1 43 new 3 36 42 0
7 1 43 new 3 42 56 1
8 1 43 new 4 56 87 0
9 1 43 new 4 87 88 1
10 2 43 new 1 0 6 0
11 2 43 new 1 6 9 0
12 2 43 new 1 9 15 0
13 2 43 new 1 15 17 0
14 2 43 new 1 17 36 0
15 2 43 new 1 36 42 1
16 2 43 new 2 42 56 0
17 2 43 new 2 56 87 1
18 2 43 new 3 87 88 0
19 2 43 new 3 88 91 0
20 3 41 new 1 0 6 0
21 3 41 new 1 6 9 0
22 3 41 new 1 9 15 1
23 3 41 new 2 15 17 1
24 3 41 new 3 17 36 1
25 3 41 new 4 36 42 0
26 3 41 new 4 42 56 0
27 3 41 new 4 56 87 0
28 3 41 new 4 87 88 0
29 3 41 new 4 88 91 0
30 3 41 new 4 91 112 0
survsplit()
works on single row/person data. There is an event for subject 1 attime1 = 88
, so subject's 2 and 3 both have their time split at that point as well. Obviously with a lot more subjects there are potentially a lot more events and consequently times to split at. – LucaS Commented Feb 17 at 0:26dat2
is not what you want, then you must have some idea of what the actual desired output would look like. You should be able to construct an example manually, using a toy dataset of two or three individuals, with at least one without any events, at least one with two events, and at least one with only one event. – langtang Commented Feb 17 at 1:37