你的位置：首页>programmer>apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow

apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow

programmeradmin2025-04-261浏览0评论

I’m currently working on optimizing a PySpark job that involves a couple of aggregations across large datasets. I’m fairly new to processing large-scale data and am encountering issues with disk usage and job efficiency. Here are the details:

Chosen cluster:

•   Worker Nodes: 6
•   Cores per Worker: 48
•   Memory per Worker: 384 GB

Data:

•   Table A: 158 GB
•   Table B: 300 GB
•   Table C: 32 MB

Process:

1.  Read dfs from delta tables
2.  Perform a broadcast join between Table B and the small Table C.
3.  The resulting DataFrame is then joined with Table A, on three different column id, family, part_id
4.  The final job includes upsert operations into the destination.
5.  The destination table is partitioned by id, family, *date*

Only thing comes to my mind is to update cluster with more disk optimized instances, My question is how can I interpret storage tab and find a way to understand optimizing this job.

与本文相关的文章

apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow

评论列表(0)

暂无评论

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

与本文相关的文章

评论列表(0)