amazon web services - Glue S3 CSV Load Improvement

We have glue which is read csv file from s3 bucket using pyspark load().

   read_s3_files = spark.read.format("csv") \
   .option("header", True) \
   .option("ignoreLeadingWhiteSpace", True) \
   .option("ignoreTrailingWhiteSpace", True) \
   .load(s3_file_path)

Currently we are facing two issue -

it is taking ~15-20 sec for 1 record.
the csv load spark job is running on driver if we process more than 20 records.

How can we improve glue csv pyspark processing?

We have glue which is read csv file from s3 bucket using pyspark load().

   read_s3_files = spark.read.format("csv") \
   .option("header", True) \
   .option("ignoreLeadingWhiteSpace", True) \
   .option("ignoreTrailingWhiteSpace", True) \
   .load(s3_file_path)

Currently we are facing two issue -

it is taking ~15-20 sec for 1 record.
the csv load spark job is running on driver if we process more than 20 records.

How can we improve glue csv pyspark processing?

Share Improve this question asked Nov 20, 2024 at 11:47 saiyantan 11 bronze badge

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

You can use Parallelism by Splitting the Input Data

If the CSV file is large, ensure it’s split into multiple files. AWS Glue processes files in parallel if multiple partitions or smaller files are present.

read_s3_files = spark.read.format("csv") \
    .option("header", True) \
    .option("ignoreLeadingWhiteSpace", True) \
    .option("ignoreTrailingWhiteSpace", True) \
    .option("multiLine", True) \
    .load("s3://bucket/prefix/")  # Folder path, not a single file

you can also dynamically adjust the number of partitions depending on your files and cluster:

read_s3_files = read_s3_files.repartition(200)  # Adjust based on your cluster size

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

amazon web services - Glue S3 CSV Load Improvement - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)