apache spark - I have a Hive table with multiple partitions, and in one of the partitions, I have nearly 200,000 small files, wh

Running: hdfs dfs -du -h /user/hive/warehouse/customers/partition_dt=2025-03-27

returns: 1.0G 3.0G /user/hive/warehouse/customers/partition_dt=2025-03-27

To compact this, I tried: 1.Copying the data to another partition (partition_dt=2025-03-27+1000 days) 2.Using INSERT OVERWRITE to reload it back into partition_dt=2025-03-27 However, this approach fails when the partition has too many files. Some files are so large that compaction does not complete successfully.

How can I efficiently merge small files in this partition without causing failures?

Solution 1: Use Hive Major Compaction (For ACID Tables) ALTER TABLE customers PARTITION (partition_dt='2025-03-27') COMPACT 'MAJOR';

Solution 2: Use INSERT OVERWRITE with Merge Settings (For Non-ACID Tables) SET hive.merge.mapfiles=true; SET hive.merge.size.per.task=256000000; SET hive.merge.smallfiles.avgsize=128000000;

INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers WHERE partition_dt='2025-03-27' DISTRIBUTE BY partition_dt;

Solution 3: Use CREATE TABLE AS SELECT (CTAS) and Rename CREATE TABLE customers_tmp STORED AS ORC AS SELECT * FROM customers WHERE partition_dt='2025-03-27';

ALTER TABLE customers DROP PARTITION (partition_dt='2025-03-27');

INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_tmp;

DROP TABLE customers_tmp;

Solution 4: Use HDFS distcp to Merge Small Files hadoop fs -mkdir /tmp/customers_merged hadoop distcp -Ddfs.replication=1 -blocksperchunk 32 \ /user/hive/warehouse/customers/partition_dt=2025-03-27 /tmp/customers_merged

Then, load back into Hive: LOAD DATA INPATH '/tmp/customers_merged' OVERWRITE INTO TABLE customers PARTITION (partition_dt='2025-03-27');

Solution 5: Use Spark to Read and Write Larger Files from pyspark.sql import SparkSession

`spark = SparkSession.builder.appName("HiveCompaction").enableHiveSupport().getOrCreate()

df = spark.sql("SELECT * FROM customers WHERE partition_dt='2025-03-27'")

df.coalesce(10).write.mode("overwrite").format("parquet").saveAsTable("customers_compacted")`

Then, overwrite the partition in Hive: INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_compacted;

Which method should I use for large-scale Hive table compaction?

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

apache spark - I have a Hive table with multiple partitions, and in one of the partitions, I have nearly 200,000 small files, wh

与本文相关的文章

评论列表(0)