I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS regression to model the relationship between these variables, but it's extremely slow - taking more than 5 days to run and still not completed.
Here's a snippet of my current approach:
Loess_model <- tryCatch({
loess(EVI ~ cwd, data = filtered_data, span = 0.5)
}, error = function(e) {
print(paste("LOESS fitting failed for landcover:",
landcover_val, "rp_group:", rp_group_val))
print(paste("Error:", e))
return(NULL)
})
I'm processing multiple landcover-return period groups in parallel (using the future package), but even with parallelization, the computational time is prohibitive. Some of my datasets contain over 1 000 000 observations for a single group.
I've already:
- Converted to data.table for faster data handling
- Implemented parallel processing with the future package
- Explored different span values
What alternatives would you recommend for:
- A faster LOESS implementation in R
- Alternative smoothing/regression approaches that scale better with large datasets
My main goal as shown if this example is to identify thresholds across different landcover types and drought return periods, so I need a smoothing approach that can capture the non-linear relationship effectively.
Has anyone tackled a similar problem or can recommend alternatives to standard LOESS that would be more computationally efficient?
I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS regression to model the relationship between these variables, but it's extremely slow - taking more than 5 days to run and still not completed.
Here's a snippet of my current approach:
Loess_model <- tryCatch({
loess(EVI ~ cwd, data = filtered_data, span = 0.5)
}, error = function(e) {
print(paste("LOESS fitting failed for landcover:",
landcover_val, "rp_group:", rp_group_val))
print(paste("Error:", e))
return(NULL)
})
I'm processing multiple landcover-return period groups in parallel (using the future package), but even with parallelization, the computational time is prohibitive. Some of my datasets contain over 1 000 000 observations for a single group.
I've already:
- Converted to data.table for faster data handling
- Implemented parallel processing with the future package
- Explored different span values
What alternatives would you recommend for:
- A faster LOESS implementation in R
- Alternative smoothing/regression approaches that scale better with large datasets
My main goal as shown if this example is to identify thresholds across different landcover types and drought return periods, so I need a smoothing approach that can capture the non-linear relationship effectively.
Has anyone tackled a similar problem or can recommend alternatives to standard LOESS that would be more computationally efficient?
Share Improve this question asked Mar 16 at 14:43 ShunreiShunrei 3292 silver badges11 bronze badges 16 | Show 11 more comments1 Answer
Reset to default 4Use lowess
built-in to R, or Hmisc::movStats
. movStats
uses data.table
to efficiently compute smooth estimates using moving overlapping windows of x
. Here is an example, with timings.
Side comment: Smoothing is better at showing that thresholds don't exist than it is for finding useful thresholds, since to be useful, relationships need to be flat on both sides of the threshold, which doesn't occur in nature very often.
require(Hmisc)
require(ggplot2)
set.seed(1)
n <- 1000000
x <- runif(n)
y <- x ^ 2 + runif(n)
system.time(f <- lowess(x, y)) # 1.3s
system.time(m <- movStats(y ~ x, melt=TRUE)) # 0.4s
ggplot(m, aes(x=x, y=y, color=Statistic)) + geom_line()
ggplot2::geom_smooth
is "stats::loess()
is used for less than 1,000 observations; otherwisemgcv::gam()
is used withformula = y ~ s(x, bs = "cs")
withmethod = "REML"
. Somewhat anecdotally, loess gives a better appearance, but is O(N^2) in memory, so does not work for larger datasets." So on that basis I would givegam
a try. – Andrew Gustar Commented Mar 16 at 14:50cars
data take 375.5 seconds, based onx=1e6; 2.1 + 2.54e-05*x^1 + 3.48e-10*x^2
. UsingOPENBLAS-OPENMP
. Check which BLAS you're using. ProvidesessionInfo()
as a first step. – jay.sf Commented Mar 16 at 17:12libopenblasp
has parallel capabilities, it isn't necessarily using parallelOPENMP
but probably sth like single-threadedSERIAL
. Switch toOPENMP
. Assuming you're running Linux, you can do that using FlexiBLAS very easily. – jay.sf Commented Mar 16 at 18:17RhpcBLASctl
package to query/set OpenMP/BLAS threading ... – Ben Bolker Commented Mar 16 at 18:25mgcv::bam
. You can fix the spline df if you don't want to use the penalized regression implemented in mgcv. – Roland Commented Mar 17 at 7:56