最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

dataframe - How to run large scale processing of multiple shortest path finding calls? - Stack Overflow

programmeradmin3浏览0评论

I'm trying to incorporate a solution that takes a start and end coordinates, alongside timestamps, to find the shortest path between them. This uses the UK road network pulled from OSM, the start and end lat and lon, and finds the closest nodes in the network, and then the shortest path.

The current engineering pipeline uses Apache Spark on serverless dataproc.

The main issue is that I have to calculate the shortest path for 12 million start and end lat lon examples. They're not all unique examples, so I could preprocess certain routes but it would be a large number still, around 500,000, and I'd still need to calculate new timestamps for each example regardless.

I'm looking for a solution, that given a graph network, I can compute millions of shortest paths, and return the points, within those paths, as timestamps based on the individual examples.

Currently it's built as a pandas udf. The linestrings needed to build the network are broadcast as an Arrow dataframe. For each example it then builds a sub graph given an area around the start and end points, computes the shortest path, and adds timestamps for the points in the routes. This takes a long time to do and doesn't scale well.

I've explored GraphFrames and GraphX but the shortest_path algorithm doesn't return the actual path, just the weight, and the bfs doesn't incorporate weights (which in this case is the travel time based on the speed of the road). And with either example you're not able to easily iterate over multiple paths (at least not that I've found). Any potential ideas would be really appreciated.

发布评论

评论列表(0)

  1. 暂无评论