最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - LOESS on very large dataset - Stack Overflow

programmeradmin3浏览0评论

I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS regression to model the relationship between these variables, but it's extremely slow - taking more than 5 days to run and still not completed.

Here's a snippet of my current approach:

Loess_model <- tryCatch({
  loess(EVI ~ cwd, data = filtered_data, span = 0.5)
}, error = function(e) {
  print(paste("LOESS fitting failed for landcover:",
              landcover_val, "rp_group:", rp_group_val))
  print(paste("Error:", e))
  return(NULL)
})

I'm processing multiple landcover-return period groups in parallel (using the future package), but even with parallelization, the computational time is prohibitive. Some of my datasets contain over 1 000 000 observations for a single group.

I've already:

  • Converted to data.table for faster data handling
  • Implemented parallel processing with the future package
  • Explored different span values

What alternatives would you recommend for:

  • A faster LOESS implementation in R
  • Alternative smoothing/regression approaches that scale better with large datasets

My main goal as shown if this example is to identify thresholds across different landcover types and drought return periods, so I need a smoothing approach that can capture the non-linear relationship effectively.

Has anyone tackled a similar problem or can recommend alternatives to standard LOESS that would be more computationally efficient?

I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS regression to model the relationship between these variables, but it's extremely slow - taking more than 5 days to run and still not completed.

Here's a snippet of my current approach:

Loess_model <- tryCatch({
  loess(EVI ~ cwd, data = filtered_data, span = 0.5)
}, error = function(e) {
  print(paste("LOESS fitting failed for landcover:",
              landcover_val, "rp_group:", rp_group_val))
  print(paste("Error:", e))
  return(NULL)
})

I'm processing multiple landcover-return period groups in parallel (using the future package), but even with parallelization, the computational time is prohibitive. Some of my datasets contain over 1 000 000 observations for a single group.

I've already:

  • Converted to data.table for faster data handling
  • Implemented parallel processing with the future package
  • Explored different span values

What alternatives would you recommend for:

  • A faster LOESS implementation in R
  • Alternative smoothing/regression approaches that scale better with large datasets

My main goal as shown if this example is to identify thresholds across different landcover types and drought return periods, so I need a smoothing approach that can capture the non-linear relationship effectively.

Has anyone tackled a similar problem or can recommend alternatives to standard LOESS that would be more computationally efficient?

Share Improve this question asked Mar 16 at 14:43 ShunreiShunrei 3292 silver badges11 bronze badges 16
  • 5 The default behaviour of ggplot2::geom_smooth is "stats::loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs") with method = "REML". Somewhat anecdotally, loess gives a better appearance, but is O(N^2) in memory, so does not work for larger datasets." So on that basis I would give gam a try. – Andrew Gustar Commented Mar 16 at 14:50
  • 1 On my machine, 1e6 rows cars data take 375.5 seconds, based on x=1e6; 2.1 + 2.54e-05*x^1 + 3.48e-10*x^2. Using OPENBLAS-OPENMP. Check which BLAS you're using. Provide sessionInfo() as a first step. – jay.sf Commented Mar 16 at 17:12
  • 2 While libopenblasp has parallel capabilities, it isn't necessarily using parallel OPENMP but probably sth like single-threaded SERIAL. Switch to OPENMP. Assuming you're running Linux, you can do that using FlexiBLAS very easily. – jay.sf Commented Mar 16 at 18:17
  • 3 also try out the RhpcBLASctl package to query/set OpenMP/BLAS threading ... – Ben Bolker Commented Mar 16 at 18:25
  • 2 I'd echo what other have said: Fit a GAM. I would use mgcv::bam. You can fix the spline df if you don't want to use the penalized regression implemented in mgcv. – Roland Commented Mar 17 at 7:56
 |  Show 11 more comments

1 Answer 1

Reset to default 4

Use lowess built-in to R, or Hmisc::movStats. movStats uses data.table to efficiently compute smooth estimates using moving overlapping windows of x. Here is an example, with timings.

Side comment: Smoothing is better at showing that thresholds don't exist than it is for finding useful thresholds, since to be useful, relationships need to be flat on both sides of the threshold, which doesn't occur in nature very often.

require(Hmisc)
require(ggplot2)

set.seed(1)
n <- 1000000
x <- runif(n)
y <- x ^ 2 + runif(n)
system.time(f <- lowess(x, y))   # 1.3s

system.time(m <- movStats(y ~ x, melt=TRUE))   # 0.4s
ggplot(m, aes(x=x, y=y, color=Statistic)) + geom_line()

发布评论

评论列表(0)

  1. 暂无评论