最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pyspark - Are Spark checkpoints invalidated when source data is changed? - Stack Overflow

programmeradmin1浏览0评论

I'm trying to understand the behaviour of (py)Spark checkpointing. Let's say there is some source data (s3 or HDFS) A with intermediate RDDs checkpoints B and C:

If the source data changes (e.g. new data is added), will B and C be recalculated?

If not, is the standard approach to set the checkpoint directory to be named according to e.g. timestamps or something else?

I did consider Spark checkpointing behaviour however the answers only covered code changes.

I'm trying to understand the behaviour of (py)Spark checkpointing. Let's say there is some source data (s3 or HDFS) A with intermediate RDDs checkpoints B and C:

If the source data changes (e.g. new data is added), will B and C be recalculated?

If not, is the standard approach to set the checkpoint directory to be named according to e.g. timestamps or something else?

I did consider Spark checkpointing behaviour however the answers only covered code changes.

Share Improve this question asked Jan 30 at 10:36 kd88kd88 1,20411 silver badges23 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

No. Checkpoints B and C are snapshots, so as to avoid recomputation from the source A, in case of failure during Spark App.

Conversely, changes to source are irrelevant, not recognized, if there is no failure, whether or not checkpointing is applied. If there is failure and there in no checkpointing, then depending on type of source, newer changed data may be read, but that is handled neatly per Spark Stage.

发布评论

评论列表(0)

  1. 暂无评论