We have glue which is read csv file from s3 bucket using pyspark load().
read_s3_files = spark.read.format("csv") \
.option("header", True) \
.option("ignoreLeadingWhiteSpace", True) \
.option("ignoreTrailingWhiteSpace", True) \
.load(s3_file_path)
Currently we are facing two issue -
- it is taking ~15-20 sec for 1 record.
- the csv load spark job is running on driver if we process more than 20 records.
How can we improve glue csv pyspark processing?
We have glue which is read csv file from s3 bucket using pyspark load().
read_s3_files = spark.read.format("csv") \
.option("header", True) \
.option("ignoreLeadingWhiteSpace", True) \
.option("ignoreTrailingWhiteSpace", True) \
.load(s3_file_path)
Currently we are facing two issue -
- it is taking ~15-20 sec for 1 record.
- the csv load spark job is running on driver if we process more than 20 records.
How can we improve glue csv pyspark processing?
Share Improve this question asked Nov 20, 2024 at 11:47 saiyantansaiyantan 11 bronze badge1 Answer
Reset to default 0You can use Parallelism by Splitting the Input Data
If the CSV file is large, ensure it’s split into multiple files. AWS Glue processes files in parallel if multiple partitions or smaller files are present.
read_s3_files = spark.read.format("csv") \
.option("header", True) \
.option("ignoreLeadingWhiteSpace", True) \
.option("ignoreTrailingWhiteSpace", True) \
.option("multiLine", True) \
.load("s3://bucket/prefix/") # Folder path, not a single file
you can also dynamically adjust the number of partitions depending on your files and cluster:
read_s3_files = read_s3_files.repartition(200) # Adjust based on your cluster size