I need to check which points in piracy_sf are contained within eez_t_final polygons, but I can't finish my analyses because it runs out of memory. My computer has 18GB RAM.
My code:
sf_use_s2(TRUE)
batch_size <- 5
num_batches <- ceiling(nrow(eez_t_final) / batch_size)
results <- list()
for (i in 1:num_batches) {
start <- (i - 1) * batch_size + 1
end <- min(i * batch_size, nrow(eez_t_final))
eez_chunk <- eez_t_final[start:end, ]
# Remove duplicate vertices using rmapshaper::ms_simplify
eez_chunk <- ms_simplify(eez_chunk, keep = 0.6) # `keep = 0` will remove duplicate vertices
# Check if geometries are valid
if (any(!st_is_valid(eez_chunk))) {
# Fix invalid geometries using st_make_valid() for polygons that are invalid
eez_chunk <- st_make_valid(eez_chunk)
}
# Simplify geometries if necessary (optional)
eez_chunk <- ms_simplify(eez_chunk, keep = 0.05) # Adjust 'keep' value as necessary
# Ensure proper orientation of the polygons (if needed)
eez_chunk <- st_reverse(eez_chunk) # Reverse orientation if needed
# Perform the spatial operation
results[[i]] <- st_contains(eez_chunk, piracy_sf, sparse = TRUE)
# Clean up memory after each batch
gc()
}
# Combine results
containment_matrix <- do.call(c, results)
# Count piracy events per polygon
piracy_count_vector <- sapply(containment_matrix, length)
# Extract eez_country values for each polygon
eez_country_list <- lapply(containment_matrix, function(idx) {
unique(piracy_sf$eez_country[idx]) # Get unique country values
})
# Optionally, store eez_country as a concatenated string
eez_country_string <- sapply(eez_country_list, function(x) paste(x, collapse = ", "))
# Add piracy counts and eez_country to results_sf
eez_t_final$piracy_count <- piracy_count_vector
eez_t_final$eez_country <- eez_country_string
# Print first few rows
print(head(eez_t_final))
To solve it, I've tried to batch my analysis as well simplify the geometries, but I always reached the same point.
I need to check which points in piracy_sf are contained within eez_t_final polygons, but I can't finish my analyses because it runs out of memory. My computer has 18GB RAM.
My code:
sf_use_s2(TRUE)
batch_size <- 5
num_batches <- ceiling(nrow(eez_t_final) / batch_size)
results <- list()
for (i in 1:num_batches) {
start <- (i - 1) * batch_size + 1
end <- min(i * batch_size, nrow(eez_t_final))
eez_chunk <- eez_t_final[start:end, ]
# Remove duplicate vertices using rmapshaper::ms_simplify
eez_chunk <- ms_simplify(eez_chunk, keep = 0.6) # `keep = 0` will remove duplicate vertices
# Check if geometries are valid
if (any(!st_is_valid(eez_chunk))) {
# Fix invalid geometries using st_make_valid() for polygons that are invalid
eez_chunk <- st_make_valid(eez_chunk)
}
# Simplify geometries if necessary (optional)
eez_chunk <- ms_simplify(eez_chunk, keep = 0.05) # Adjust 'keep' value as necessary
# Ensure proper orientation of the polygons (if needed)
eez_chunk <- st_reverse(eez_chunk) # Reverse orientation if needed
# Perform the spatial operation
results[[i]] <- st_contains(eez_chunk, piracy_sf, sparse = TRUE)
# Clean up memory after each batch
gc()
}
# Combine results
containment_matrix <- do.call(c, results)
# Count piracy events per polygon
piracy_count_vector <- sapply(containment_matrix, length)
# Extract eez_country values for each polygon
eez_country_list <- lapply(containment_matrix, function(idx) {
unique(piracy_sf$eez_country[idx]) # Get unique country values
})
# Optionally, store eez_country as a concatenated string
eez_country_string <- sapply(eez_country_list, function(x) paste(x, collapse = ", "))
# Add piracy counts and eez_country to results_sf
eez_t_final$piracy_count <- piracy_count_vector
eez_t_final$eez_country <- eez_country_string
# Print first few rows
print(head(eez_t_final))
To solve it, I've tried to batch my analysis as well simplify the geometries, but I always reached the same point.
Share Improve this question edited Mar 23 at 20:27 L Tyrone 7,57323 gold badges28 silver badges41 bronze badges asked Mar 23 at 15:40 chiara ausendachiara ausenda 1 1- Some idea about the size & structure of your piracy_sf object would be helpful - is it a long list of points (easier) or lines or polygons (more tricky)? In theory you should be able to transform your problem from R (which is memory constrained) to a database operation (which is disk space constrained) using a tool such a s spatialite or PostGIS – Jindra Lacko Commented Mar 23 at 21:11
2 Answers
Reset to default 2This is a always a difficult problem to solve (barring the classical "buy more memory"). But I can see two strategies to try to improve your situation.
Store your results in disk: You have sucessfully split your calculation in batches, but the results of this calculation are still stored in one list
results
. This list will become enourmous and probably consume your memory. Consider addingsaveRDS
on the end of every batch computation and removing the object after. That will release a lot of memory, at the expense of more time, disk space, and some clever thinking on how to conduct the next analysis in steps.Cut analysis geographically: Similar to the previous point you can do the analysis in smaller geographical areas, which would allow you to reduce the size of your results, but also of your input (you don't need to load all EEZ, but only the zones of that area).
st_read
allows for querying only certain features from some files.Simplify your polygons: Similar to what L tyrone suggested use simpler polygons and if necessary conduct the analysis with greater detail only when it is absolutely necessary
st_bbox
is the most radical way to do that, replacing the EEZ for a simple rectangle.
Given you have not provided metadata about your data, I assume:
- you are using the latest and most detailed EEZ data downloaded from Marineregions.
- both piracy_sf and eez_t_final share the same CRS (WGS84/EPSG:4326) and
st_bbox()
extent e.g. the min/max longitude values for both files are either 0/360, or -180/180 - your eez_t_final file has not been subsetted and contains every EEZ
Loops are incredibly inefficient, see Circle 3 of the R Inferno for more details. And you are making it worse by running ms_simplify()
, st_make_valid()
etc. inside your loop. Running these on the full dataset before running your loop would be better. But obviously, avoiding loops altogether (where possible) is the best way.
I first applied the following solution using the latest, most detailed EEZ data and while it completed relatively quickly, viewing the result was very slow. That EEZ file has 10,187,922 nodes, but R should be able to handle a file of this size so I suspect something else is going on.
Also, based on my second and third assumptions, using sf_use_s2(TRUE)
is problematic because of the way some sf
functions handle S2 objects.
Either way, here is a workflow that will assign point counts to each EEZ. I have used the "Low res" EEZ file, but it does work for the most detailed EEZ option. Viewing the result for the most detailed EEZ may be slow though, but filtering the result to areas of interest, or creating a df version of eez_t_final, will speed up viewing the data e.g.:
# Subset
sub1 <- eez_t_final[eez_t_final$TERRITORY1 == "Tonga", ]
sub2 <- eez_t_final[eez_t_final$TERRITORY1 %in% c("Tonga", "Samoa"), ]
# Create dataframe version
df <- st_drop_geometry(eez_t_final)
In this example, counts are summarised by the GEONAME column, modify to suit if this is incorrect.
One further point, some EEZ are administered by > 1 authority so the total count may be greater than the number of points in piracy_sf. Based on the example piracy_sf object created below, eight points were counted twice.
library(sf)
library(dplyr)
# Turn off S2 processing
sf_use_s2(FALSE)
# Unzip and load "Maritime Boundaries (latest version)" [Low res] EEZ
# downloaded from https://marineregions./downloads.php
# Modify paths to match your working directory etc.
unzip("data/EEZ/World_EEZ_v12_20231025_LR.zip",
exdir = "data/EEZ")
# Load EEZ, repair invalid geometries that make exist
# st_wrap_dateline() only needed for most detailed EEZ to
# convert 0/360 longitude coordinates to -180/180
eez_t_final <- st_read(
"data/EEZ/World_EEZ_v12_20231025_LR/eez_v12_lowres.shp"
) |>
# st_wrap_dateline() |>
st_make_valid()
# Create example points
set.seed(42)
piracy_sf <- data.frame(id = 1:1e4,
lon = runif(1e4, -180, 180),
lat = runif(1e4, -60, 86)) |>
st_as_sf(coords = c("lon", "lat"), crs = st_crs(eez_t_final))
# Find which points fall within each EEZ, summarise counts per EEZ GEONAME
# You can ignore the warning
p_count <- st_join(piracy_sf, eez_t_final) |>
st_drop_geometry() |>
summarise(count = n(), .by = GEONAME)
# although coordinates are longitude/latitude, st_intersects assumes that they are
# planar
# Join counts to eez_t_final
eez_t_final <- left_join(eez_t_final, p_count, by = "GEONAME")
# Examine result
eez_t_final[, c("GEONAME", "count")]
# Simple feature collection with 285 features and 2 fields (with 1 geometry empty)
# Geometry type: GEOMETRY
# Dimension: XY
# Bounding box: xmin: -180 ymin: -62.78834 xmax: 180 ymax: 86.99373
# Geodetic CRS: WGS 84
# First 10 features:
# GEONAME count geometry
# 1 United States Exclusive Economic Zone (American Samoa) 7 POLYGON ((-166.6606 -17.548...
# 2 British Exclusive Economic Zone (Ascension) 8 POLYGON ((-10.93248 -7.9552...
# 3 New Zealand Exclusive Economic Zone (Cook Islands) 36 POLYGON ((-158.3847 -6.3777...
# 4 Overlapping claim Falkland / Malvinas Islands: United Kingdom / Argentina 16 POLYGON ((-61.62049 -53.781...
# 5 French Exclusive Economic Zone (French Polynesia) 73 MULTIPOLYGON (((-132.3897 -...
# 6 British Exclusive Economic Zone (Pitcairn) 13 POLYGON ((-133.4247 -26.568...
# 7 British Exclusive Economic Zone (Saint Helena) 9 POLYGON ((-2.167928 -15.977...
# 8 Samoan Exclusive Economic Zone 1 POLYGON ((-173.7557 -11.052...
# 9 Tongan Exclusive Economic Zone 9 POLYGON ((-171.8011 -15.960...
# 10 British Exclusive Economic Zone (Tristan da Cunha) 15 POLYGON ((-12.25361 -33.723...