r - LOESS on very large dataset

I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS regression to model the relationship between these variables, but it's extremely slow - taking more than 5 days to run and still not completed.

Here's a snippet of my current approach:

Loess_model <- tryCatch({
  loess(EVI ~ cwd, data = filtered_data, span = 0.5)
}, error = function(e) {
  print(paste("LOESS fitting failed for landcover:",
              landcover_val, "rp_group:", rp_group_val))
  print(paste("Error:", e))
  return(NULL)
})

I'm processing multiple landcover-return period groups in parallel (using the future package), but even with parallelization, the computational time is prohibitive. Some of my datasets contain over 1 000 000 observations for a single group.

I've already:

Converted to data.table for faster data handling
Implemented parallel processing with the future package
Explored different span values

What alternatives would you recommend for:

A faster LOESS implementation in R
Alternative smoothing/regression approaches that scale better with large datasets

My main goal as shown if this example is to identify thresholds across different landcover types and drought return periods, so I need a smoothing approach that can capture the non-linear relationship effectively.

Has anyone tackled a similar problem or can recommend alternatives to standard LOESS that would be more computationally efficient?

Here's a snippet of my current approach:

Loess_model <- tryCatch({
  loess(EVI ~ cwd, data = filtered_data, span = 0.5)
}, error = function(e) {
  print(paste("LOESS fitting failed for landcover:",
              landcover_val, "rp_group:", rp_group_val))
  print(paste("Error:", e))
  return(NULL)
})

I've already:

Converted to data.table for faster data handling
Implemented parallel processing with the future package
Explored different span values

What alternatives would you recommend for:

A faster LOESS implementation in R
Alternative smoothing/regression approaches that scale better with large datasets

Has anyone tackled a similar problem or can recommend alternatives to standard LOESS that would be more computationally efficient?

Share Improve this question asked Mar 16 at 14:43 Shunrei 3292 silver badges11 bronze badges

5 The default behaviour of ggplot2::geom_smooth is "stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs") with method = "REML". Somewhat anecdotally, loess gives a better appearance, but is O(N^2) in memory, so does not work for larger datasets." So on that basis I would give gam a try. – Andrew Gustar Commented Mar 16 at 14:50
1 On my machine, 1e6 rows cars data take 375.5 seconds, based on x=1e6; 2.1 + 2.54e-05*x^1 + 3.48e-10*x^2. Using OPENBLAS-OPENMP. Check which BLAS you're using. Provide sessionInfo() as a first step. – jay.sf Commented Mar 16 at 17:12
2 While libopenblasp has parallel capabilities, it isn't necessarily using parallel OPENMP but probably sth like single-threaded SERIAL. Switch to OPENMP. Assuming you're running Linux, you can do that using FlexiBLAS very easily. – jay.sf Commented Mar 16 at 18:17
3 also try out the RhpcBLASctl package to query/set OpenMP/BLAS threading ... – Ben Bolker Commented Mar 16 at 18:25
2 I'd echo what other have said: Fit a GAM. I would use mgcv::bam. You can fix the spline df if you don't want to use the penalized regression implemented in mgcv. – Roland Commented Mar 17 at 7:56

| Show 11 more comments

1 Answer 1

Sorted by: Reset to default 4

Use lowess built-in to R, or Hmisc::movStats. movStats uses data.table to efficiently compute smooth estimates using moving overlapping windows of x. Here is an example, with timings.

Side comment: Smoothing is better at showing that thresholds don't exist than it is for finding useful thresholds, since to be useful, relationships need to be flat on both sides of the threshold, which doesn't occur in nature very often.

require(Hmisc)
require(ggplot2)

set.seed(1)
n <- 1000000
x <- runif(n)
y <- x ^ 2 + runif(n)
system.time(f <- lowess(x, y))   # 1.3s

system.time(m <- movStats(y ~ x, melt=TRUE))   # 0.4s
ggplot(m, aes(x=x, y=y, color=Statistic)) + geom_line()

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - LOESS on very large dataset - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)