I am working in Azure Databricks with a large PySpark DataFrame that has 170 columns. I need to identify the best possible combination of 2-3 columns to use as the primary key, ensuring:
Uniqueness: The selected combination should uniquely identify each row. Data Integrity: The combination should avoid NULLs and duplicates. Performance: The approach should be efficient and scalable on large datasets. Business Relevance: The chosen columns should align with business logic constraints. What I have tried: I attempted an approach using itertoolsbinations() to check uniqueness for different column combinations. However, this method is slow on large datasets due to multiple .distinct().count() operations.
Here’s my code:
from itertools import combinations
selected_columns = [
"Message_MessageBody_Header_",
"Message_FileSequenceNo",
"Message_MessageBody_ArticleInfo_BNo",
"EnvDate",
"EnvTime"
] # Replace with actual column names
total_count = df.count()
print(f"Total records in DataFrame: {total_count}")
missing_columns = [col for col in selected_columns if col not in df.columns]
if missing_columns:
print(f"Error: The following columns are missing in the DataFrame: {missing_columns}")
else:
print(f"Selected columns exist in the DataFrame: {selected_columns}")
found_primary_key = False
for r in range(2, len(selected_columns) + 1):
print(f"\nChecking {r}-column combinations...")
for combo in combinations(selected_columns, r):
print(f"\n