最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Is there another way to make a time ceiling other than Lubridate? - Stack Overflow

programmeradmin2浏览0评论

I'm processing data that I need to group into half hour time periods, which I've been doing by creating a time ceiling so that the timestamps are rounded up to the next half hour. I've been using the Lubridate package for this; however, since my dataset has around 200,000 observations, it takes a very long time to run.

I'm wondering if there are any functions or packages that can create a time ceiling without taking so long. I've looked around the internet and every source just says to use Lubridate.

Below is the code that I'm using to create the date ceiling. I'm not sure if providing sample data will be helpful since the trouble comes with the size of the dataset (the code itself runs perfectly well, just slowly).

  Twr2CowDist1<-Twr2CowDist%>%
  mutate(Round=ceiling_date(ymd_hms(DateTime), "30 minutes"))

Edit: I am using rowwise earlier in the code, so that might be the cause for the slowdown, and by "slow", I mean around 15-20 minutes to run the couple lines of code.

I'm processing data that I need to group into half hour time periods, which I've been doing by creating a time ceiling so that the timestamps are rounded up to the next half hour. I've been using the Lubridate package for this; however, since my dataset has around 200,000 observations, it takes a very long time to run.

I'm wondering if there are any functions or packages that can create a time ceiling without taking so long. I've looked around the internet and every source just says to use Lubridate.

Below is the code that I'm using to create the date ceiling. I'm not sure if providing sample data will be helpful since the trouble comes with the size of the dataset (the code itself runs perfectly well, just slowly).

  Twr2CowDist1<-Twr2CowDist%>%
  mutate(Round=ceiling_date(ymd_hms(DateTime), "30 minutes"))

Edit: I am using rowwise earlier in the code, so that might be the cause for the slowdown, and by "slow", I mean around 15-20 minutes to run the couple lines of code.

Share Improve this question edited Mar 13 at 15:16 shrimp asked Mar 12 at 14:05 shrimpshrimp 1011 silver badge4 bronze badges 11
  • 1 I am not sure lubridate is really the bottleneck but assuming that it is, you could always get the numeric value of any date in seconds, check only the last 4 digits (an hour is 3600s, half an hour is 1800s) and round up from there. This should be simple enough to implement in Rcpp (even for non Cpp Coders) for some serious speed gains. This is a comment for a reason however so take this idea with a bag of salt. – D.J Commented Mar 12 at 14:11
  • 1 You still might want to create some sample data seq(c(ISOdate(2000,3,20,1)), by = "15 min", length.out = 100) – Tim G Commented Mar 12 at 14:17
  • 2 Strange, if I run ceiling_date(seq(c(ISOdate(2000,3,20,1)), by = "15 min", length.out = 200000), "30 minutes") it takes around 30 ms, which I don't think is a lot. G. Grothendieck's runs in 1 ms, but both are still considered fast if you'd ask me – Tim G Commented Mar 12 at 14:29
  • 3 No, G.G's answer proved nothing about bottlenecks, just that there is something faster (by microseconds!). I ran lubridate::ceiling_date on over 500K observations, it took 0.050 seconds. Calling it a bottleneck may be statistically correct IFF it is the slowest part of the process, but I have a hard time supporting that anybody should divert a lot of attention to trying to speed that up. If that execution time is a big problem, then I suggest R is not the language of choice for whatever realtime processing one appears to need. – r2evans Commented Mar 12 at 16:22
  • 4 For performance optimization questions, it's helpful to give context about what level of performance is the goal, given that changes might entail tradeoffs in readability, maintainability, consistency, etc. "it takes a very long time to run" might mean "1 hour" (in which case you have a different problem) or "200 microseconds" (in which case optimization might be worthwhile). I wonder if Twr2CowDist might be a grouped dataframe (or even have rowwise() applied), which could be a reason this code could take a perceptibly long time to run. – Jon Spring Commented Mar 12 at 17:15
 |  Show 6 more comments

2 Answers 2

Reset to default 6

Convert the datetime to numeric, divide by 60 * 30 seconds, take the ceiling, multiply the 60 * 30 seconds back and convert back to POSIXct.

x <- as.POSIXct("2025-03-12 10:15:57")  # test data
.POSIXct(60 * 30 * ceiling(as.numeric(x) / (60 * 30)))
## [1] "2025-03-12 10:30:00 EDT"

In this test it runs about 9x faster than lubridate.

library(microbenchmark)
x <- as.POSIXct("2025-03-12 10:15:57")  # test data
microbenchmark(
  R = .POSIXct(60 * 30 * ceiling(as.numeric(x) / (60 * 30))),
  lub = ceiling_date(x, "30 minutes")
)
## Unit: microseconds
##  expr     min      lq      mean  median       uq     max neval cld
##     R  10.702  11.601  15.81799  16.951  17.7010  51.201   100  a 
##   lub 147.400 149.201 161.77997 150.001 152.0015 650.701   100   b

Unless your performance expectations are high, I suspect something else is going on here, like a grouped data frame.

Here is an example where 200,000 rows of text-formatted datetimes are parsed and ceiling'd in approx 0.4 seconds on my computer. If that is the performance you are getting and it's not fast enough, then optimize. I suspect that you were seeing something different when you wrote "very long time to run," though. One reason this might be occurring is because you have a grouped data frame.

set.seed(42)
Twr2CowDist <- data.frame(DateTime = (as.POSIXct("2025-01-01") + runif(2E5,0,1E8)) |> 
                            format("%Y %b %d %H:%M:%S"))
Twr2CowDist |>
    mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes"))

Your code runs decently fast in my example when the data is not grouped. But it gets slower and slower the more groupings we have. Here I take the 200,000 rows and divide into 10 groups and 20,000 groups, respectively. That last version takes 74 seconds, almost 200x slower than the ungrouped version. If you have used rowwise(), then it'd be substantially worse than that.

Since grouping is not relevant for this step, I would add an ungroup() |> line before your mutate line and add the grouping back afterwards, if your data is indeed grouped.

You might also consider running the same calculations using dtplyr, which'll let you run the same code (and most dplyr operations) using data.table for the backend. This will be up to 12,000x faster, such as in the "20k groups" examples below. (In this case the grouping doesn't actually do anything, so is a hindrance with no benefit. I'm just demonstrating that here, data.table is better at avoiding a performance hit due to it.)

microbenchmark::microbenchmark(
  orig = Twr2CowDist |>
    mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
  groups_100 = Twr2CowDist |>
    group_by(grp = (row_number() - 1) %/% 2000) |>    # 100 groups
    mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")), 
  groups_20k = Twr2CowDist |>
    group_by(grp = (row_number() - 1) %/% 10) |>      # 20k groups
    mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),

  orig_dt = Twr2CowDist |>
    dtplyr::lazy_dt() |>
    mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
  groups_100_dt = Twr2CowDist |>
    dtplyr::lazy_dt() |>
    group_by(grp = (row_number() - 1) %/% 2000) |>    # 100 groups
    mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")), 
  groups_20k_dt = Twr2CowDist |>
    dtplyr::lazy_dt() |>
    group_by(grp = (row_number() - 1) %/% 10) |>      # 20k groups
    mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
  times = 5)

Unit: milliseconds
          expr          min           lq         mean       median           uq          max neval cld
          orig   397.533321   430.853033   440.502552   439.294297   439.370909   495.461201     5  a 
    groups_100  2326.377332  2405.361442  2532.349297  2416.150142  2680.580061  2833.277509     5  a 
    groups_20k 70034.137024 72161.511090 74279.618267 73176.393597 76362.165271 79663.884354     5   b
       orig_dt     2.489188     2.600704     3.982829     2.814318     3.414945     8.594992     5  a 
 groups_100_dt     3.741155     3.874049     7.827267     4.151648     6.864427    20.505058     5  a 
 groups_20k_dt     4.132508     4.470350     5.935595     5.016446     7.623643     8.435029     5  a
发布评论

评论列表(0)

  1. 暂无评论