I'm processing data that I need to group into half hour time periods, which I've been doing by creating a time ceiling so that the timestamps are rounded up to the next half hour. I've been using the Lubridate package for this; however, since my dataset has around 200,000 observations, it takes a very long time to run.
I'm wondering if there are any functions or packages that can create a time ceiling without taking so long. I've looked around the internet and every source just says to use Lubridate.
Below is the code that I'm using to create the date ceiling. I'm not sure if providing sample data will be helpful since the trouble comes with the size of the dataset (the code itself runs perfectly well, just slowly).
Twr2CowDist1<-Twr2CowDist%>%
mutate(Round=ceiling_date(ymd_hms(DateTime), "30 minutes"))
Edit: I am using rowwise earlier in the code, so that might be the cause for the slowdown, and by "slow", I mean around 15-20 minutes to run the couple lines of code.
I'm processing data that I need to group into half hour time periods, which I've been doing by creating a time ceiling so that the timestamps are rounded up to the next half hour. I've been using the Lubridate package for this; however, since my dataset has around 200,000 observations, it takes a very long time to run.
I'm wondering if there are any functions or packages that can create a time ceiling without taking so long. I've looked around the internet and every source just says to use Lubridate.
Below is the code that I'm using to create the date ceiling. I'm not sure if providing sample data will be helpful since the trouble comes with the size of the dataset (the code itself runs perfectly well, just slowly).
Twr2CowDist1<-Twr2CowDist%>%
mutate(Round=ceiling_date(ymd_hms(DateTime), "30 minutes"))
Edit: I am using rowwise earlier in the code, so that might be the cause for the slowdown, and by "slow", I mean around 15-20 minutes to run the couple lines of code.
Share Improve this question edited Mar 13 at 15:16 shrimp asked Mar 12 at 14:05 shrimpshrimp 1011 silver badge4 bronze badges 11 | Show 6 more comments2 Answers
Reset to default 6Convert the datetime to numeric, divide by 60 * 30 seconds, take the ceiling, multiply the 60 * 30 seconds back and convert back to POSIXct.
x <- as.POSIXct("2025-03-12 10:15:57") # test data
.POSIXct(60 * 30 * ceiling(as.numeric(x) / (60 * 30)))
## [1] "2025-03-12 10:30:00 EDT"
In this test it runs about 9x faster than lubridate.
library(microbenchmark)
x <- as.POSIXct("2025-03-12 10:15:57") # test data
microbenchmark(
R = .POSIXct(60 * 30 * ceiling(as.numeric(x) / (60 * 30))),
lub = ceiling_date(x, "30 minutes")
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## R 10.702 11.601 15.81799 16.951 17.7010 51.201 100 a
## lub 147.400 149.201 161.77997 150.001 152.0015 650.701 100 b
Unless your performance expectations are high, I suspect something else is going on here, like a grouped data frame.
Here is an example where 200,000 rows of text-formatted datetimes are parsed and ceiling'd in approx 0.4 seconds on my computer. If that is the performance you are getting and it's not fast enough, then optimize. I suspect that you were seeing something different when you wrote "very long time to run," though. One reason this might be occurring is because you have a grouped data frame.
set.seed(42)
Twr2CowDist <- data.frame(DateTime = (as.POSIXct("2025-01-01") + runif(2E5,0,1E8)) |>
format("%Y %b %d %H:%M:%S"))
Twr2CowDist |>
mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes"))
Your code runs decently fast in my example when the data is not grouped. But it gets slower and slower the more groupings we have. Here I take the 200,000 rows and divide into 10 groups and 20,000 groups, respectively. That last version takes 74 seconds, almost 200x slower than the ungrouped version. If you have used rowwise()
, then it'd be substantially worse than that.
Since grouping is not relevant for this step, I would add an ungroup() |>
line before your mutate
line and add the grouping back afterwards, if your data is indeed grouped.
You might also consider running the same calculations using dtplyr
, which'll let you run the same code (and most dplyr operations) using data.table for the backend. This will be up to 12,000x faster, such as in the "20k groups" examples below. (In this case the grouping doesn't actually do anything, so is a hindrance with no benefit. I'm just demonstrating that here, data.table is better at avoiding a performance hit due to it.)
microbenchmark::microbenchmark(
orig = Twr2CowDist |>
mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
groups_100 = Twr2CowDist |>
group_by(grp = (row_number() - 1) %/% 2000) |> # 100 groups
mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
groups_20k = Twr2CowDist |>
group_by(grp = (row_number() - 1) %/% 10) |> # 20k groups
mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
orig_dt = Twr2CowDist |>
dtplyr::lazy_dt() |>
mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
groups_100_dt = Twr2CowDist |>
dtplyr::lazy_dt() |>
group_by(grp = (row_number() - 1) %/% 2000) |> # 100 groups
mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
groups_20k_dt = Twr2CowDist |>
dtplyr::lazy_dt() |>
group_by(grp = (row_number() - 1) %/% 10) |> # 20k groups
mutate(DateTime2 = ymd_hms(DateTime) |> ceiling_date("30 minutes")),
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval cld
orig 397.533321 430.853033 440.502552 439.294297 439.370909 495.461201 5 a
groups_100 2326.377332 2405.361442 2532.349297 2416.150142 2680.580061 2833.277509 5 a
groups_20k 70034.137024 72161.511090 74279.618267 73176.393597 76362.165271 79663.884354 5 b
orig_dt 2.489188 2.600704 3.982829 2.814318 3.414945 8.594992 5 a
groups_100_dt 3.741155 3.874049 7.827267 4.151648 6.864427 20.505058 5 a
groups_20k_dt 4.132508 4.470350 5.935595 5.016446 7.623643 8.435029 5 a
Rcpp
(even for non Cpp Coders) for some serious speed gains. This is a comment for a reason however so take this idea with a bag of salt. – D.J Commented Mar 12 at 14:11seq(c(ISOdate(2000,3,20,1)), by = "15 min", length.out = 100)
– Tim G Commented Mar 12 at 14:17ceiling_date(seq(c(ISOdate(2000,3,20,1)), by = "15 min", length.out = 200000), "30 minutes")
it takes around 30 ms, which I don't think is a lot. G. Grothendieck's runs in 1 ms, but both are still considered fast if you'd ask me – Tim G Commented Mar 12 at 14:29lubridate::ceiling_date
on over 500K observations, it took 0.050 seconds. Calling it a bottleneck may be statistically correct IFF it is the slowest part of the process, but I have a hard time supporting that anybody should divert a lot of attention to trying to speed that up. If that execution time is a big problem, then I suggest R is not the language of choice for whatever realtime processing one appears to need. – r2evans Commented Mar 12 at 16:22Twr2CowDist
might be a grouped dataframe (or even haverowwise()
applied), which could be a reason this code could take a perceptibly long time to run. – Jon Spring Commented Mar 12 at 17:15