最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

azure blob storage - How to write parquet file of size 1GB in python databricks - Stack Overflow

programmeradmin0浏览0评论

I am trying to write parquet file in ADLS Gen2 using python notebook in Azure databricks.


df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)


When the files are writeen with the date partion, for each parttion I see multiple smaller size files are being written. I want to write files with filesizes of 1GB so that when reading the files reading will be faster.For example if the total size is 4GB then 4 1 GB files instead of 20 200 Mb Files and if its less than 1 GB then 1 GB File.

One way is to estimate the size of data and do a reparttion or coalesce. But its consuming too much time. Is there any other way or settings that can be applied so that files will be directly written with 1 GB size rather than smaller files.

Tried the below but they don't work too. What else I can try?


spark = SparkSession.builder \  
    .appName("ParquetFileSizeControl") \  
    .config("spark.sql.files.maxPartitionBytes", str(1024 * 1024 * 1024))  # 1GB  
    .config("parquet.block.size", str(1024 * 1024 * 1024))  # 1GB (optional)  
    .getOrCreate() 

df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)

I am trying to write parquet file in ADLS Gen2 using python notebook in Azure databricks.


df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)


When the files are writeen with the date partion, for each parttion I see multiple smaller size files are being written. I want to write files with filesizes of 1GB so that when reading the files reading will be faster.For example if the total size is 4GB then 4 1 GB files instead of 20 200 Mb Files and if its less than 1 GB then 1 GB File.

One way is to estimate the size of data and do a reparttion or coalesce. But its consuming too much time. Is there any other way or settings that can be applied so that files will be directly written with 1 GB size rather than smaller files.

Tried the below but they don't work too. What else I can try?


spark = SparkSession.builder \  
    .appName("ParquetFileSizeControl") \  
    .config("spark.sql.files.maxPartitionBytes", str(1024 * 1024 * 1024))  # 1GB  
    .config("parquet.block.size", str(1024 * 1024 * 1024))  # 1GB (optional)  
    .getOrCreate() 

df.write.partitionBy(f"date").mode("overwrite").parquet(target_file_path)

Share Improve this question asked Mar 2 at 6:03 newbienewbie 557 bronze badges 4
  • This question is similar to: How do you control the size of the output file?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – Kraigolas Commented Mar 2 at 6:27
  • 1 Side note: Writing "bigger file" won't necessary make fiel reading faster if you are using a distributed system. Writing file "smartly" is the way to go. – Itération 122442 Commented Mar 2 at 7:25
  • Spark splits data into 200 partitions by default, creating many small files. – Dileep Raj Narayan Thumula Commented Mar 3 at 3:48
  • I didn't see the other post "How do you control the size of the output file?" has any solution. – newbie Commented Mar 3 at 12:20
Add a comment  | 

1 Answer 1

Reset to default 0

To achieve 1GB file sizes, you need to control the number of output partitions. Here is how you can do it:

You can calculate the DataFrame size and determine the number of partitions:

data_size_bytes = df.rdd.map(lambda row: len(str(row))).sum()
data_size_gb = data_size_bytes / (1024 * 1024 * 1024)
num_partitions = max(1, int(data_size_gb))
print(f"Estimated Data Size: {data_size_gb:.2f} GB")
print(f"Optimal Partitions: {num_partitions}")

Results:

Estimated Data Size: 5.05 GB
Optimal Partitions: 5

In the above code getting the estimated size in bytes Converting to GB & determining the number of partitions (1GB per partition)

Then write the data to (ADLS Gen2) while ensuring 1GB file sizes.

target_file_path = "abfss://<You Container>@<Your storage account>.dfs.core.windows/parquet_data"

Then Apply repartitioning before writing:

df.repartition(num_partitions).write \
    .partitionBy("date") \
    .mode("overwrite") \
    .parquet(target_file_path)
发布评论

评论列表(0)

  1. 暂无评论