最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Calculate the count of distinct values appearing in multiple tables - Stack Overflow

programmeradmin0浏览0评论

I have three pyspark dataframes in Databricks: raw_old, raw_new, and master_df. These are placeholders to work out the logic on a smaller scale (actual tables contain billions of rows of data). There is a column in all three called label. I want to calculate the number of labels that appear in:

  • raw_old and raw_new (the answer is 3: A789, B456, D123)
  • raw_new and master_df (the answer is 2: C456, D123)
  • raw_old and master_df (the answer is 4: A654, B987, C987, D123)
  • raw_old, raw_new, and master_df (the answer is 1: D123)

The three tables are below. How do I calculate the above bullet points?

raw_old

+---+-----+
| id|label|
+---+-----+
|  1| A987|
|  2| A654|
|  3| A789|
|  4| B321|
|  5| B456|
|  6| B987|
|  7| C321|
|  8| C654|
|  9| C987|
| 10| D123|
+---+-----+

raw_new

+---+-----+
| id|label|
+---+-----+
|  1| A123|
|  2| A456|
|  3| A789|
|  4| B123|
|  5| B456|
|  6| B789|
|  7| C123|
|  8| C456|
|  9| C789|
| 10| D123|
+---+-----+

master_df

+---+-----+
| id|label|
+---+-----+
|  1| A999|
|  2| A654|
|  3| A000|
|  4| B111|
|  5| B000|
|  6| B987|
|  7| C999|
|  8| C456|
|  9| C987|
| 10| D123|
+---+-----+

I have three pyspark dataframes in Databricks: raw_old, raw_new, and master_df. These are placeholders to work out the logic on a smaller scale (actual tables contain billions of rows of data). There is a column in all three called label. I want to calculate the number of labels that appear in:

  • raw_old and raw_new (the answer is 3: A789, B456, D123)
  • raw_new and master_df (the answer is 2: C456, D123)
  • raw_old and master_df (the answer is 4: A654, B987, C987, D123)
  • raw_old, raw_new, and master_df (the answer is 1: D123)

The three tables are below. How do I calculate the above bullet points?

raw_old

+---+-----+
| id|label|
+---+-----+
|  1| A987|
|  2| A654|
|  3| A789|
|  4| B321|
|  5| B456|
|  6| B987|
|  7| C321|
|  8| C654|
|  9| C987|
| 10| D123|
+---+-----+

raw_new

+---+-----+
| id|label|
+---+-----+
|  1| A123|
|  2| A456|
|  3| A789|
|  4| B123|
|  5| B456|
|  6| B789|
|  7| C123|
|  8| C456|
|  9| C789|
| 10| D123|
+---+-----+

master_df

+---+-----+
| id|label|
+---+-----+
|  1| A999|
|  2| A654|
|  3| A000|
|  4| B111|
|  5| B000|
|  6| B987|
|  7| C999|
|  8| C456|
|  9| C987|
| 10| D123|
+---+-----+
Share Improve this question asked Jan 31 at 17:48 carousalliecarousallie 8731 gold badge11 silver badges32 bronze badges 1
  • ask deepseek to compare intersect, inner join and left_semi join for Apache Spark dataframe. – lihao Commented Feb 2 at 4:54
Add a comment  | 

2 Answers 2

Reset to default 1

You should use an inner join to get the elements in common between the datasets

joined_data = raw_old.join(
    raw_new,
    on=raw_old["label"] == raw_new["label"],
    how="inner"
)

and then you can collect the result back to Python, keeping all the heavy work in Spark

print(joined_data.count())

When joining 3 dataframes, you can do the first 2 and join the resulted dataframe with the third one.

I'd create a set from the labels of one dataframe, intersect it with the others and get the size of the resulting set:

num_shared = \
    len(set(raw_old.label).intersection(*(s.label for s in (raw_new, master_df))))
发布评论

评论列表(0)

  1. 暂无评论