I am using df.cache() to cachce a dataframe and using databricks autoscaling with Min instance as 1 and Max instance as 8. But cache don't work properly here because some executors dies in the middle of execution and cached data also got lost. When I set the min and max instance equal then I can see cache works fine. How to configure so that during downscaling cached data don't get lost?
I am using df.cache() to cachce a dataframe and using databricks autoscaling with Min instance as 1 and Max instance as 8. But cache don't work properly here because some executors dies in the middle of execution and cached data also got lost. When I set the min and max instance equal then I can see cache works fine. How to configure so that during downscaling cached data don't get lost?
Share Improve this question asked yesterday gaurav naranggaurav narang 213 bronze badges1 Answer
Reset to default 1The only practical things you can do in a plain-vanilla environment are:
- Use of checkpointing.
- Or use of DISK_ONLY option. Slower.
Of course the Executor may fail before anything written.
Recompute from source vs. last checkpoint is still possible due to Spark's Fault Tolerance, so in the end not a huge issue.