python - Calculate the count of distinct values appearing in multiple tables

I have three pyspark dataframes in Databricks: raw_old, raw_new, and master_df. These are placeholders to work out the logic on a smaller scale (actual tables contain billions of rows of data). There is a column in all three called label. I want to calculate the number of labels that appear in:

raw_old and raw_new (the answer is 3: A789, B456, D123)
raw_new and master_df (the answer is 2: C456, D123)
raw_old and master_df (the answer is 4: A654, B987, C987, D123)
raw_old, raw_new, and master_df (the answer is 1: D123)

The three tables are below. How do I calculate the above bullet points?

raw_old

+---+-----+
| id|label|
+---+-----+
|  1| A987|
|  2| A654|
|  3| A789|
|  4| B321|
|  5| B456|
|  6| B987|
|  7| C321|
|  8| C654|
|  9| C987|
| 10| D123|
+---+-----+

raw_new

+---+-----+
| id|label|
+---+-----+
|  1| A123|
|  2| A456|
|  3| A789|
|  4| B123|
|  5| B456|
|  6| B789|
|  7| C123|
|  8| C456|
|  9| C789|
| 10| D123|
+---+-----+

master_df

+---+-----+
| id|label|
+---+-----+
|  1| A999|
|  2| A654|
|  3| A000|
|  4| B111|
|  5| B000|
|  6| B987|
|  7| C999|
|  8| C456|
|  9| C987|
| 10| D123|
+---+-----+

raw_old and raw_new (the answer is 3: A789, B456, D123)
raw_new and master_df (the answer is 2: C456, D123)
raw_old and master_df (the answer is 4: A654, B987, C987, D123)
raw_old, raw_new, and master_df (the answer is 1: D123)

The three tables are below. How do I calculate the above bullet points?

raw_old

+---+-----+
| id|label|
+---+-----+
|  1| A987|
|  2| A654|
|  3| A789|
|  4| B321|
|  5| B456|
|  6| B987|
|  7| C321|
|  8| C654|
|  9| C987|
| 10| D123|
+---+-----+

raw_new

+---+-----+
| id|label|
+---+-----+
|  1| A123|
|  2| A456|
|  3| A789|
|  4| B123|
|  5| B456|
|  6| B789|
|  7| C123|
|  8| C456|
|  9| C789|
| 10| D123|
+---+-----+

master_df

+---+-----+
| id|label|
+---+-----+
|  1| A999|
|  2| A654|
|  3| A000|
|  4| B111|
|  5| B000|
|  6| B987|
|  7| C999|
|  8| C456|
|  9| C987|
| 10| D123|
+---+-----+

Share Improve this question asked Jan 31 at 17:48 carousallie 8731 gold badge11 silver badges32 bronze badges

ask deepseek to compare intersect, inner join and left_semi join for Apache Spark dataframe. – lihao Commented Feb 2 at 4:54

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

You should use an inner join to get the elements in common between the datasets

joined_data = raw_old.join(
    raw_new,
    on=raw_old["label"] == raw_new["label"],
    how="inner"
)

and then you can collect the result back to Python, keeping all the heavy work in Spark

print(joined_data.count())

When joining 3 dataframes, you can do the first 2 and join the resulted dataframe with the third one.

I'd create a set from the labels of one dataframe, intersect it with the others and get the size of the resulting set:

num_shared = \
    len(set(raw_old.label).intersection(*(s.label for s in (raw_new, master_df))))

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Calculate the count of distinct values appearing in multiple tables - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)