最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Creating a variable indicating last report of the day for each id but some last reports are after midnight - Stack Overflow

programmeradmin5浏览0评论

I have a dataframe like this:

A tibble: 30 x 4
       id index   day time    
    <dbl> <int> <int> <chr>   
 1 238686     1     1 11:53:33
 2 238686     2     1 17:45:27
 3 238686     3     1 21:12:36
 4 238686     4     2 00:32:36
 5 238686     5     2 11:07:08
 6 238686     6     2 14:43:41
 7 238686     7     2 20:50:29
 8 238686     8     2 23:22:33
 9 238686     9     3 12:05:53
10 238686    10     3 14:48:50

id refers to a participant (each participant provides several reports per day for 10 days), index refers to the consequtive number of report per participant throughout the whole study, day refers to day of study for each participant, and time refers to the time of day of each report.

I'd need a new variable indicating the last report for each day for each participant. I got a very helpful reply from here to do this using

ex_data <- ex_data |>
     mutate(last=as.integer(max(index) == index), .by = c(id, day))

This worked great; however, I hadn't realized that some participants had provided the last report of the day after midnight on some days. I managed to mark those occasions using

ex_data$time2<-as.hms(ex_data$time)

ex_data <- ex_data |>
mutate(nighttime = if_else(time2 >= parse_hms("00:00:00") & time2 < parse_hms("03:00:01"), 1, 0))

But I can't come up with a way to create a variable that would indicate whether a report is "not last report of the day" or "last report of the day" for each participant that would include last reports of the day given before and after midnight.

Here's the data (it's a snippet of the actual data):

ex_data<-
structure(list(id = c(238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
239297, 239297, 239297), index = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 
34L, 35L, 36L, 37L, 1L, 2L, 3L), day = c(1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 6L, 
6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 1L, 1L, 1L), time = c("11:53:33", "17:45:27", "21:12:36", 
"00:32:36", "11:07:08", "14:43:41", "20:50:29", "23:22:33", "12:05:53", 
"14:48:50", "21:15:33", "12:09:46", "14:27:06", "18:01:24", "20:56:40", 
"23:17:18", "11:19:02", "17:32:18", "00:05:52", "11:45:11", "18:10:10", 
"20:08:09", "00:30:00", "11:17:36", "14:29:43", "18:12:06", "20:54:48", 
"23:20:32", "11:16:00", "17:26:45", "11:45:13", "14:30:26", "18:31:49", 
"20:31:42", "23:47:41", "14:16:07", "23:55:13", "11:16:34", "12:01:56", 
"14:33:38")), row.names = c(NA, -40L), class = c("tbl_df", "tbl", 
"data.frame"))

And what I'd like is

   id     index   day time       last  
  238686     1     1  11:53:33   0
  238686     2     1  17:45:27   0
  238686     3     1  21:12:36   0
  238686     4     2  00:32:36   1
  238686     5     2  11:07:08   0
  238686     6     2  14:43:41   0
  238686     7     2  20:50:29   0
  238686     8     2  23:22:33   1
  238686     9     3  12:05:53   0

I have a dataframe like this:

A tibble: 30 x 4
       id index   day time    
    <dbl> <int> <int> <chr>   
 1 238686     1     1 11:53:33
 2 238686     2     1 17:45:27
 3 238686     3     1 21:12:36
 4 238686     4     2 00:32:36
 5 238686     5     2 11:07:08
 6 238686     6     2 14:43:41
 7 238686     7     2 20:50:29
 8 238686     8     2 23:22:33
 9 238686     9     3 12:05:53
10 238686    10     3 14:48:50

id refers to a participant (each participant provides several reports per day for 10 days), index refers to the consequtive number of report per participant throughout the whole study, day refers to day of study for each participant, and time refers to the time of day of each report.

I'd need a new variable indicating the last report for each day for each participant. I got a very helpful reply from here to do this using

ex_data <- ex_data |>
     mutate(last=as.integer(max(index) == index), .by = c(id, day))

This worked great; however, I hadn't realized that some participants had provided the last report of the day after midnight on some days. I managed to mark those occasions using

ex_data$time2<-as.hms(ex_data$time)

ex_data <- ex_data |>
mutate(nighttime = if_else(time2 >= parse_hms("00:00:00") & time2 < parse_hms("03:00:01"), 1, 0))

But I can't come up with a way to create a variable that would indicate whether a report is "not last report of the day" or "last report of the day" for each participant that would include last reports of the day given before and after midnight.

Here's the data (it's a snippet of the actual data):

ex_data<-
structure(list(id = c(238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
238686, 238686, 238686, 238686, 238686, 238686, 238686, 238686, 
239297, 239297, 239297), index = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 
34L, 35L, 36L, 37L, 1L, 2L, 3L), day = c(1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L, 6L, 
6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 1L, 1L, 1L), time = c("11:53:33", "17:45:27", "21:12:36", 
"00:32:36", "11:07:08", "14:43:41", "20:50:29", "23:22:33", "12:05:53", 
"14:48:50", "21:15:33", "12:09:46", "14:27:06", "18:01:24", "20:56:40", 
"23:17:18", "11:19:02", "17:32:18", "00:05:52", "11:45:11", "18:10:10", 
"20:08:09", "00:30:00", "11:17:36", "14:29:43", "18:12:06", "20:54:48", 
"23:20:32", "11:16:00", "17:26:45", "11:45:13", "14:30:26", "18:31:49", 
"20:31:42", "23:47:41", "14:16:07", "23:55:13", "11:16:34", "12:01:56", 
"14:33:38")), row.names = c(NA, -40L), class = c("tbl_df", "tbl", 
"data.frame"))

And what I'd like is

   id     index   day time       last  
  238686     1     1  11:53:33   0
  238686     2     1  17:45:27   0
  238686     3     1  21:12:36   0
  238686     4     2  00:32:36   1
  238686     5     2  11:07:08   0
  238686     6     2  14:43:41   0
  238686     7     2  20:50:29   0
  238686     8     2  23:22:33   1
  238686     9     3  12:05:53   0
Share Improve this question asked Mar 26 at 13:58 SointuSointu 2091 silver badge7 bronze badges 5
  • @Friede I think that's too heavy-handed for what OP wants, always assuming the first row of the next day is associated with the previous. But if you look at rows 16 and 17 of ex_data, I think that is not what OP wants. – Gregor Thomas Commented Mar 26 at 16:49
  • Ugh, rows 16 and 17? I just looked at expected output having nine rows.. – Friede Commented Mar 26 at 16:50
  • 2 @Sointu, could you propose a cut-off time at which reports should be assigned to the previous day? Maybe 01:00:00? Maybe 04:00:00? Is that what the 03:00:01 is in your attempt? – Gregor Thomas Commented Mar 26 at 16:50
  • @GregorThomas yes, exactly! Now that I checked there are actually no reports between ~1am and next day's first report, 03:00:00 was put in as the latest time I'd want to include a report. – Sointu Commented Mar 27 at 8:11
  • Sorry for the late reactions, it was late in my time zone when the site went down yesterday. Thank you so much @GregorThomas, your code works perfectly and I'm able to decipher from it how it does it. Thank you again!! – Sointu Commented Mar 27 at 8:22
Add a comment  | 

1 Answer 1

Reset to default 4

So, we need to treat times before some cut-off time as if they were on the previous day. Create a new modified_day column that does this.

day_cutoff = "03:00:01"
ex_data |>
  mutate(day_modified = ifelse(time < day_cutoff, day - 1, day)) |>
  mutate(last = as.integer(row_number() == n()), .by = c(id, day_modified)) |>
  print(n = 20)
# # A tibble: 40 × 6
#        id index   day time     day_modified  last
#     <dbl> <int> <int> <chr>           <dbl> <int>
#  1 238686     1     1 11:53:33            1     0
#  2 238686     2     1 17:45:27            1     0
#  3 238686     3     1 21:12:36            1     0
#  4 238686     4     2 00:32:36            1     1
#  5 238686     5     2 11:07:08            2     0
#  6 238686     6     2 14:43:41            2     0
#  7 238686     7     2 20:50:29            2     0
#  8 238686     8     2 23:22:33            2     1
#  9 238686     9     3 12:05:53            3     0
# 10 238686    10     3 14:48:50            3     0
# 11 238686    11     3 21:15:33            3     1
# 12 238686    12     4 12:09:46            4     0
# 13 238686    13     4 14:27:06            4     0
# 14 238686    14     4 18:01:24            4     0
# 15 238686    15     4 20:56:40            4     0
# 16 238686    16     4 23:17:18            4     1
# 17 238686    17     5 11:19:02            5     0
# 18 238686    18     5 17:32:18            5     0
# 19 238686    19     6 00:05:52            5     1
# 20 238686    20     6 11:45:11            6     0
# # ℹ 20 more rows
# # ℹ Use `print(n = ...)` to see more rows

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论