Appache spark : Cannot grow BufferHolder by size 524432 because the size after growing exceeds size limitation 2147483632

I am working with a large and nested JSON dataset in Apache Spark and encountering a "max buffer size exceeded" exception during the writing process.

My Processing Steps:

Read the JSON file.
Explode nested structures.
Filter unnecessary data
Select relevant columns.
Count the records.
Write the final DataFrame.

Issue: During the count() or write() operations, Spark is recomputing all transformations from the beginning, leading to excessive memory usage and eventually the max buffer size exceeded error.

What I Tried:

Initially, I got a GC (Garbage Collection) error, so I increased the executor memory (spark.driver.memory, spark.executor.memory).

Now, the GC error is gone, but I still get the "max buffer size exceeded" error during count() or write().

Spark seems to recompute all transformations during these actions, leading to excessive memory usage.

Questions: How can I prevent Spark from recomputing all transformations at the final stage?

Is caching or checkpointing an effective solution here?

Are there any specific configurations to handle this buffer size limitation? enter image description here

I am working with a large and nested JSON dataset in Apache Spark and encountering a "max buffer size exceeded" exception during the writing process.

My Processing Steps:

Read the JSON file.
Explode nested structures.
Filter unnecessary data
Select relevant columns.
Count the records.
Write the final DataFrame.

Issue: During the count() or write() operations, Spark is recomputing all transformations from the beginning, leading to excessive memory usage and eventually the max buffer size exceeded error.

What I Tried:

Initially, I got a GC (Garbage Collection) error, so I increased the executor memory (spark.driver.memory, spark.executor.memory).

Now, the GC error is gone, but I still get the "max buffer size exceeded" error during count() or write().

Spark seems to recompute all transformations during these actions, leading to excessive memory usage.

Questions: How can I prevent Spark from recomputing all transformations at the final stage?

Is caching or checkpointing an effective solution here?

Are there any specific configurations to handle this buffer size limitation? enter image description here

Share Improve this question asked Mar 27 at 9:29 Ram Shan 112 bronze badges

Filter first and select before exploding. If you can. – Ged Commented Mar 27 at 9:58
Cache? Appropiate level. – Ged Commented Mar 27 at 10:11

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Spark is not recomputing in the final stage but it is doing lazy evaluation i.e it is doing actual computation when it sees the action. So basically when it sees write action it starts reading the data , transforms it and writes to the destination. This is what spark usually does. Since you have not attached your code i am assuming that has been the case. Cache would not help here. This error comes if one column size turns out to be of very big size. So you need to check what exactly is your explode doing. Below is a thread on same error which yo u can refer. There are couple of solutions provided in the thread.
https://community.databricks/t5/data-engineering/bufferholder-exceeded-on-json-flattening/td-p/12873

Btw you can attach code and insert image in the question itself instead of linking it. Will be easier to check.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Appache spark : Cannot grow BufferHolder by size 524432 because the size after growing exceeds size limitation 2147483632 - Stac

1 Answer 1

与本文相关的文章

评论列表(0)