Running:
hdfs dfs -du -h /user/hive/warehouse/customers/partition_dt=2025-03-27
returns:
1.0G 3.0G /user/hive/warehouse/customers/partition_dt=2025-03-27
To compact this, I tried: 1.Copying the data to another partition (partition_dt=2025-03-27+1000 days) 2.Using INSERT OVERWRITE to reload it back into partition_dt=2025-03-27 However, this approach fails when the partition has too many files. Some files are so large that compaction does not complete successfully.
How can I efficiently merge small files in this partition without causing failures?
Solution 1: Use Hive Major Compaction (For ACID Tables)
ALTER TABLE customers PARTITION (partition_dt='2025-03-27') COMPACT 'MAJOR';
Solution 2: Use INSERT OVERWRITE with Merge Settings (For Non-ACID Tables)
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=128000000;
INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers WHERE partition_dt='2025-03-27' DISTRIBUTE BY partition_dt;
Solution 3: Use CREATE TABLE AS SELECT (CTAS) and Rename
CREATE TABLE customers_tmp STORED AS ORC AS SELECT * FROM customers WHERE partition_dt='2025-03-27';
ALTER TABLE customers DROP PARTITION (partition_dt='2025-03-27');
INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_tmp;
DROP TABLE customers_tmp;
Solution 4: Use HDFS distcp to Merge Small Files
hadoop fs -mkdir /tmp/customers_merged
hadoop distcp -Ddfs.replication=1 -blocksperchunk 32 \ /user/hive/warehouse/customers/partition_dt=2025-03-27 /tmp/customers_merged
Then, load back into Hive:
LOAD DATA INPATH '/tmp/customers_merged' OVERWRITE INTO TABLE customers PARTITION (partition_dt='2025-03-27');
Solution 5: Use Spark to Read and Write Larger Files
from pyspark.sql import SparkSession
`spark = SparkSession.builder.appName("HiveCompaction").enableHiveSupport().getOrCreate()
df = spark.sql("SELECT * FROM customers WHERE partition_dt='2025-03-27'")
df.coalesce(10).write.mode("overwrite").format("parquet").saveAsTable("customers_compacted")`
Then, overwrite the partition in Hive:
INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_compacted;
Which method should I use for large-scale Hive table compaction?