r - Optimizing a complex, custom data problem - find the optimal polygon that contains a give fraction of data, with constraints

I am working with Uranium-Lead geochronological datasets. This type of data is typically interpreted in what is called a "Wetherill" concordia diagram, an x-y coordinate space where each axis contains Pb/U ratios that correspond to one age estimate. Given that both axes have independent age estimates, but the isotopic systems decay at different rates, the line of equal age (concordia) in this space has curvature to it.

Geochronologists interpret data arrays in this space where the relevant age information is contained in where arrays intersect the concordia line, the curved line of equal age.

Typical interpretations rely on single data arrays following a single line, which is readily regressed to find intersections with the concordia line, and uncertainty is robustly propagated on those regressions.

However, more complex datasets can be difficult to interpret.

The fundamental information in more complex scenarios can be represented by a polygon, where the corners of the polygon must fall on the concordia curve - these are the points that contain the age information. In essence, we can interpret the boundary lines in a dense data array as containing the geochronological information.

I am dealing with very large n datasets (>1 million data points), and I am working on a method to identify the polygon shape that contains the largest fraction of the data, but also tightly defines the outer limit of the data - in other words the optimal polygon would not be simply the largest polygon.

Below I generate some synthetic data that corresponds to a complex data array - the scenario modeled contains a large population of mineral grains that grew between 3 billion upper.int.age.max years ago and 2.5 billion years ago upper.int.age.min. These grains then experienced an alteration event 1 billion years ago lower.int.age.1, and another alteration event 100 million years ago lower.int.age.2.

## set seed
set.seed( 12 ) 

## now create ages for a model dataset events
lower.int.age.1 <- 1000
lower.int.age.2 <- 100
upper.int.age.min <- 2500
upper.int.age.max <- 3000

## a couple of required equations
ratio735 <- function(age) {   ## Age in years
  exp( 9.8485e-10 * age * 1e6 ) - 1 
}

ratio638 <- function(age) {   ## Age in years
  exp( 1.55125e-10 * age * 1e6 ) - 1 
}

## calculate x and y values from the age information
x.values <- c( ratio735(lower.int.age.1), 
ratio735(lower.int.age.2) ) # calculate 7/35 ratios
y.values <- c( ratio638(lower.int.age.1), 
ratio638(lower.int.age.2) ) # calculate 6/38 ratios


## now make a data frame of example data
data.test <- data.frame( upper.age = runif( 5000, min = 
upper.int.age.min, max = upper.int.age.max ),
                     prop.1 = runif( 5000, min = 0.05, max = 1 ),
                     prop.2 = runif( 5000, min = 0.05, max = 1 ) ) 
data.test$x1 <- ( ( ratio735( data.test$upper.age ) - x.values[1] 
) * ( data.test$prop.1 ) ) + x.values[1]
data.test$y1 <- ( ( ratio638( data.test$upper.age ) - y.values[1] 
) * ( data.test$prop.1 ) ) + y.values[1]

data.test$x2 <- ( ( data.test$x1 - x.values[2] ) * ( 
data.test$prop.2 ) ) + x.values[2]
data.test$y2 <- ( ( data.test$y1 - y.values[2] ) * ( 
data.test$prop.2 ) ) + y.values[2]

Now here is the equation for the line of equal age, the concordia line, and generate some points along the line for plotting purposes

## line equation 
concordia.x <- function(t) {
  x <- exp(9.8485e-10 * t * 1e6) - 1  
  x
}  
concordia.y <- function(t) {
  y <- exp(1.55125e-10 * t * 1e6) - 1  
  y
}


concordia.line <- data.frame( age = seq(0,3000,1) ) 
concordia.line$x <- ratio735( concordia.line$age )
concordia.line$y <- ratio638( concordia.line$age )
concordia.points <- subset( concordia.line, age == lower.int.age.1 
| age == lower.int.age.2 | age = upper.int.age.min | age == upper.int.age.max )

Now plot the synthetic data and the concordia line. The black points are the geological events.

## plot the data and line
ggplot( data.test, aes( x= x2, y = y2 ) ) +
  geom_line( data = concordia.line, aes( x = x, y = y ) ) +
  geom_point( color = "red" ) +
  geom_point( data = concordia.points, aes( x = x, y = y ),
              size = 4 )

Here is the resulting plot (the lower.int.age.1 etc. Ideally a model will return a polygon with the four bold points making up the corners of the polygon.

I'm currently doing this brute force, by generating a suite of hypothetical polygons with corners defined along the concordia line. I generate the polygons by calculating four sequences of possible corner values, then using expand.grid to find all possible combinations. Then I filter that grid for the correct sequence of points.

## create polygons
lower.1 = seq( from = 0, to = 100, by = 50 ) ## corner 1 
lower.2 = seq( from = 900, to = 1100, by = 100 ) ## corner 2 
upper.1 = seq( from = 2400, to = 2800, by = 100 ) ## corner 3
upper.2 = seq( from = 2800, to = 3200, by = 100 ) ## corner 4
# find all combinations
age.values = expand.grid( lower.1, lower.2, upper.1, upper.2 )
colnames( age.values ) <- c( "lower.1", "lower.2", "upper.1", 
"upper.2" )

##filter for polygons with a certain order to them
age.values.used <- subset( age.values, lower.1 < lower.2 & lower.2 < upper.1 & upper.1 < upper.2 ) 


## calculate a list of x,y values for each polygon
polygons <- list(NA)

## Fill the list with a for loop - not a great way, I know
for( i in 1:nrow( age.values.used ) ) {
  data.temp <- age.values.used[ i, ]
  polygons[[ i ]] <- data.frame( age = age.values.used[i,], 
                                 x = concordia.x( as.vector( t( data.temp ) ) ),
                                 y = concordia.y( as.vector( t( data.temp ) ) ) )
}

Once I have the polygons in a list, where each element contains the four corners of the polygon, I make a function to find the fraction of data in each polygon. This uses the sp::point.in.polygon function and just calculates the fraction of data in each polygon in the list of polygons.

library(sp, pracma)
## calculate the fraction of data in one polygon
in.poly   <- function( poly.frame, points ) { ##uses two 
dataframes, one with data, one with the x and y coords for the polygon
  ## use sp point.in.polygon function
  in.polygon <- sp::point.in.polygon( point.x = points[,1],
                        point.y = points[,2],
                        pol.x = poly.frame$x,
                        pol.y = poly.frame$y )

  sum( in.polygon == 1 ) / length( in.polygon )
}

Now I loop through using that equation to find the fraction in each polygon.

    ## create a blank data frame to fill in a for loop
results.data <- data.frame( upper.1 = rep( NA, times = length( 
    polygons ) ), upper.2 = rep( NA, times = length( 
    polygons ) ), lower.2 = rep( NA, times = length( 
    polygons ) ), lower.1 = rep( NA, times = length( 
    polygons ) ), fraction.data.in = rep( NA, times = 
    length( polygons ) ), area = rep( NA, times = length( polygons 
    ) ) )

Now use the for loop to add the corner ages, the fraction of data in the polygon, and calculate the area of the polygon (using the sp:Polygon function)

for( j in 1:length( polygons ) ) {
  results.data$upper.1[ j ] = polygons[[j]]$age.upper.2[1]
  results.data$upper.2[ j ] = polygons[[j]]$age.upper.2[1]
  results.data$lower.2[ j ] = polygons[[j]]$age.lower.2[1]
  results.data$lower.1[ j ] = polygons[[j]]$age.lower.1[1]
  results.data$fraction.data.in[j] <- in.poly( polygons[[j]], 
         data.test[, c("x2", "y2")] )
  rpoly.def <- sp::Polygon( polygons[[j]][,c("x","y")] )

  results.data$area[ j ] <- slot( poly.def, "area" )
  }

I can then use this data to figure out what polygons have the most data inside of them, while also having the least area to them.

This is very much a brute force method and I'm hoping to find something that is more elegant (not hard I'm sure) and runs significantly faster.

Thanks for any help you can provide, and please comment if I should explain something further.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

r - Optimizing a complex, custom data problem - find the optimal polygon that contains a give fraction of data, with constraints

与本文相关的文章

评论列表(0)