最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Best Practices for Selecting Primary Key Combinations from Multiple Columns - Stack Overflow

programmeradmin1浏览0评论

I am working in Azure Databricks with a large PySpark DataFrame that has 170 columns. I need to identify the best possible combination of 2-3 columns to use as the primary key, ensuring:

Uniqueness: The selected combination should uniquely identify each row. Data Integrity: The combination should avoid NULLs and duplicates. Performance: The approach should be efficient and scalable on large datasets. Business Relevance: The chosen columns should align with business logic constraints. What I have tried: I attempted an approach using itertoolsbinations() to check uniqueness for different column combinations. However, this method is slow on large datasets due to multiple .distinct().count() operations.

Here’s my code:

    from itertools import combinations

selected_columns = [
    "Message_MessageBody_Header_", 
    "Message_FileSequenceNo", 
    "Message_MessageBody_ArticleInfo_BNo",
    "EnvDate",
    "EnvTime"
]  # Replace with actual column names

total_count = df.count()
print(f"Total records in DataFrame: {total_count}")

missing_columns = [col for col in selected_columns if col not in df.columns]
if missing_columns:
    print(f"Error: The following columns are missing in the DataFrame: {missing_columns}")
else:
    print(f"Selected columns exist in the DataFrame: {selected_columns}")

found_primary_key = False
for r in range(2, len(selected_columns) + 1):
    print(f"\nChecking {r}-column combinations...")

    for combo in combinations(selected_columns, r):
        print(f"\n
发布评论

评论列表(0)

  1. 暂无评论