python - Best Practices for Selecting Primary Key Combinations from Multiple Columns

I am working in Azure Databricks with a large PySpark DataFrame that has 170 columns. I need to identify the best possible combination of 2-3 columns to use as the primary key, ensuring:

Uniqueness: The selected combination should uniquely identify each row. Data Integrity: The combination should avoid NULLs and duplicates. Performance: The approach should be efficient and scalable on large datasets. Business Relevance: The chosen columns should align with business logic constraints. What I have tried: I attempted an approach using itertoolsbinations() to check uniqueness for different column combinations. However, this method is slow on large datasets due to multiple .distinct().count() operations.

Here’s my code:

    from itertools import combinations

selected_columns = [
    "Message_MessageBody_Header_", 
    "Message_FileSequenceNo", 
    "Message_MessageBody_ArticleInfo_BNo",
    "EnvDate",
    "EnvTime"
]  # Replace with actual column names

total_count = df.count()
print(f"Total records in DataFrame: {total_count}")

missing_columns = [col for col in selected_columns if col not in df.columns]
if missing_columns:
    print(f"Error: The following columns are missing in the DataFrame: {missing_columns}")
else:
    print(f"Selected columns exist in the DataFrame: {selected_columns}")

found_primary_key = False
for r in range(2, len(selected_columns) + 1):
    print(f"\nChecking {r}-column combinations...")

    for combo in combinations(selected_columns, r):
        print(f"\n

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Best Practices for Selecting Primary Key Combinations from Multiple Columns - Stack Overflow

`与本文相关的文章`

`评论列表(0)`