最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow

programmeradmin1浏览0评论

I’m currently working on optimizing a PySpark job that involves a couple of aggregations across large datasets. I’m fairly new to processing large-scale data and am encountering issues with disk usage and job efficiency. Here are the details:

Chosen cluster:

•   Worker Nodes: 6
•   Cores per Worker: 48
•   Memory per Worker: 384 GB

Data:

•   Table A: 158 GB
•   Table B: 300 GB
•   Table C: 32 MB

Process:

1.  Read dfs from delta tables
2.  Perform a broadcast join between Table B and the small Table C.
3.  The resulting DataFrame is then joined with Table A, on three different column id, family, part_id
4.  The final job includes upsert operations into the destination.
5.  The destination table is partitioned by id, family, *date*

Only thing comes to my mind is to update cluster with more disk optimized instances, My question is how can I interpret storage tab and find a way to understand optimizing this job.

发布评论

评论列表(0)

  1. 暂无评论