dataframe - How to run large scale processing of multiple shortest path finding calls?

I'm trying to incorporate a solution that takes a start and end coordinates, alongside timestamps, to find the shortest path between them. This uses the UK road network pulled from OSM, the start and end lat and lon, and finds the closest nodes in the network, and then the shortest path.

The current engineering pipeline uses Apache Spark on serverless dataproc.

The main issue is that I have to calculate the shortest path for 12 million start and end lat lon examples. They're not all unique examples, so I could preprocess certain routes but it would be a large number still, around 500,000, and I'd still need to calculate new timestamps for each example regardless.

I'm looking for a solution, that given a graph network, I can compute millions of shortest paths, and return the points, within those paths, as timestamps based on the individual examples.

Currently it's built as a pandas udf. The linestrings needed to build the network are broadcast as an Arrow dataframe. For each example it then builds a sub graph given an area around the start and end points, computes the shortest path, and adds timestamps for the points in the routes. This takes a long time to do and doesn't scale well.

I've explored GraphFrames and GraphX but the shortest_path algorithm doesn't return the actual path, just the weight, and the bfs doesn't incorporate weights (which in this case is the travel time based on the speed of the road). And with either example you're not able to easily iterate over multiple paths (at least not that I've found). Any potential ideas would be really appreciated.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

dataframe - How to run large scale processing of multiple shortest path finding calls? - Stack Overflow

与本文相关的文章

评论列表(0)