最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Appache spark : Cannot grow BufferHolder by size 524432 because the size after growing exceeds size limitation 2147483632 - Stac

programmeradmin6浏览0评论

I am working with a large and nested JSON dataset in Apache Spark and encountering a "max buffer size exceeded" exception during the writing process.

My Processing Steps:

  1. Read the JSON file.
  2. Explode nested structures.
  3. Filter unnecessary data
  4. Select relevant columns.
  5. Count the records.
  6. Write the final DataFrame.

Issue: During the count() or write() operations, Spark is recomputing all transformations from the beginning, leading to excessive memory usage and eventually the max buffer size exceeded error.

What I Tried:

Initially, I got a GC (Garbage Collection) error, so I increased the executor memory (spark.driver.memory, spark.executor.memory).

Now, the GC error is gone, but I still get the "max buffer size exceeded" error during count() or write().

Spark seems to recompute all transformations during these actions, leading to excessive memory usage.

Questions: How can I prevent Spark from recomputing all transformations at the final stage?

Is caching or checkpointing an effective solution here?

Are there any specific configurations to handle this buffer size limitation? enter image description here

I am working with a large and nested JSON dataset in Apache Spark and encountering a "max buffer size exceeded" exception during the writing process.

My Processing Steps:

  1. Read the JSON file.
  2. Explode nested structures.
  3. Filter unnecessary data
  4. Select relevant columns.
  5. Count the records.
  6. Write the final DataFrame.

Issue: During the count() or write() operations, Spark is recomputing all transformations from the beginning, leading to excessive memory usage and eventually the max buffer size exceeded error.

What I Tried:

Initially, I got a GC (Garbage Collection) error, so I increased the executor memory (spark.driver.memory, spark.executor.memory).

Now, the GC error is gone, but I still get the "max buffer size exceeded" error during count() or write().

Spark seems to recompute all transformations during these actions, leading to excessive memory usage.

Questions: How can I prevent Spark from recomputing all transformations at the final stage?

Is caching or checkpointing an effective solution here?

Are there any specific configurations to handle this buffer size limitation? enter image description here

Share Improve this question asked Mar 27 at 9:29 Ram ShanRam Shan 112 bronze badges 2
  • Filter first and select before exploding. If you can. – Ged Commented Mar 27 at 9:58
  • Cache? Appropiate level. – Ged Commented Mar 27 at 10:11
Add a comment  | 

1 Answer 1

Reset to default 0

Spark is not recomputing in the final stage but it is doing lazy evaluation i.e it is doing actual computation when it sees the action. So basically when it sees write action it starts reading the data , transforms it and writes to the destination. This is what spark usually does. Since you have not attached your code i am assuming that has been the case. Cache would not help here. This error comes if one column size turns out to be of very big size. So you need to check what exactly is your explode doing. Below is a thread on same error which yo u can refer. There are couple of solutions provided in the thread.
https://community.databricks/t5/data-engineering/bufferholder-exceeded-on-json-flattening/td-p/12873

Btw you can attach code and insert image in the question itself instead of linking it. Will be easier to check.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论