最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

amazon web services - Glue S3 CSV Load Improvement - Stack Overflow

programmeradmin0浏览0评论

We have glue which is read csv file from s3 bucket using pyspark load().

   read_s3_files = spark.read.format("csv") \
   .option("header", True) \
   .option("ignoreLeadingWhiteSpace", True) \
   .option("ignoreTrailingWhiteSpace", True) \
   .load(s3_file_path)

Currently we are facing two issue -

  1. it is taking ~15-20 sec for 1 record.
  2. the csv load spark job is running on driver if we process more than 20 records.

How can we improve glue csv pyspark processing?

We have glue which is read csv file from s3 bucket using pyspark load().

   read_s3_files = spark.read.format("csv") \
   .option("header", True) \
   .option("ignoreLeadingWhiteSpace", True) \
   .option("ignoreTrailingWhiteSpace", True) \
   .load(s3_file_path)

Currently we are facing two issue -

  1. it is taking ~15-20 sec for 1 record.
  2. the csv load spark job is running on driver if we process more than 20 records.

How can we improve glue csv pyspark processing?

Share Improve this question asked Nov 20, 2024 at 11:47 saiyantansaiyantan 11 bronze badge
Add a comment  | 

1 Answer 1

Reset to default 0

You can use Parallelism by Splitting the Input Data

If the CSV file is large, ensure it’s split into multiple files. AWS Glue processes files in parallel if multiple partitions or smaller files are present.

read_s3_files = spark.read.format("csv") \
    .option("header", True) \
    .option("ignoreLeadingWhiteSpace", True) \
    .option("ignoreTrailingWhiteSpace", True) \
    .option("multiLine", True) \
    .load("s3://bucket/prefix/")  # Folder path, not a single file

you can also dynamically adjust the number of partitions depending on your files and cluster:

read_s3_files = read_s3_files.repartition(200)  # Adjust based on your cluster size
发布评论

评论列表(0)

  1. 暂无评论