I have three pyspark dataframes in Databricks: raw_old
, raw_new
, and master_df
. These are placeholders to work out the logic on a smaller scale (actual tables contain billions of rows of data). There is a column in all three called label
. I want to calculate the number of labels that appear in:
raw_old
andraw_new
(the answer is 3: A789, B456, D123)raw_new
andmaster_df
(the answer is 2: C456, D123)raw_old
andmaster_df
(the answer is 4: A654, B987, C987, D123)raw_old
,raw_new
, andmaster_df
(the answer is 1: D123)
The three tables are below. How do I calculate the above bullet points?
raw_old
+---+-----+
| id|label|
+---+-----+
| 1| A987|
| 2| A654|
| 3| A789|
| 4| B321|
| 5| B456|
| 6| B987|
| 7| C321|
| 8| C654|
| 9| C987|
| 10| D123|
+---+-----+
raw_new
+---+-----+
| id|label|
+---+-----+
| 1| A123|
| 2| A456|
| 3| A789|
| 4| B123|
| 5| B456|
| 6| B789|
| 7| C123|
| 8| C456|
| 9| C789|
| 10| D123|
+---+-----+
master_df
+---+-----+
| id|label|
+---+-----+
| 1| A999|
| 2| A654|
| 3| A000|
| 4| B111|
| 5| B000|
| 6| B987|
| 7| C999|
| 8| C456|
| 9| C987|
| 10| D123|
+---+-----+
I have three pyspark dataframes in Databricks: raw_old
, raw_new
, and master_df
. These are placeholders to work out the logic on a smaller scale (actual tables contain billions of rows of data). There is a column in all three called label
. I want to calculate the number of labels that appear in:
raw_old
andraw_new
(the answer is 3: A789, B456, D123)raw_new
andmaster_df
(the answer is 2: C456, D123)raw_old
andmaster_df
(the answer is 4: A654, B987, C987, D123)raw_old
,raw_new
, andmaster_df
(the answer is 1: D123)
The three tables are below. How do I calculate the above bullet points?
raw_old
+---+-----+
| id|label|
+---+-----+
| 1| A987|
| 2| A654|
| 3| A789|
| 4| B321|
| 5| B456|
| 6| B987|
| 7| C321|
| 8| C654|
| 9| C987|
| 10| D123|
+---+-----+
raw_new
+---+-----+
| id|label|
+---+-----+
| 1| A123|
| 2| A456|
| 3| A789|
| 4| B123|
| 5| B456|
| 6| B789|
| 7| C123|
| 8| C456|
| 9| C789|
| 10| D123|
+---+-----+
master_df
+---+-----+
| id|label|
+---+-----+
| 1| A999|
| 2| A654|
| 3| A000|
| 4| B111|
| 5| B000|
| 6| B987|
| 7| C999|
| 8| C456|
| 9| C987|
| 10| D123|
+---+-----+
Share
Improve this question
asked Jan 31 at 17:48
carousalliecarousallie
8731 gold badge11 silver badges32 bronze badges
1
- ask deepseek to compare intersect, inner join and left_semi join for Apache Spark dataframe. – lihao Commented Feb 2 at 4:54
2 Answers
Reset to default 1You should use an inner join to get the elements in common between the datasets
joined_data = raw_old.join(
raw_new,
on=raw_old["label"] == raw_new["label"],
how="inner"
)
and then you can collect the result back to Python, keeping all the heavy work in Spark
print(joined_data.count())
When joining 3 dataframes, you can do the first 2 and join the resulted dataframe with the third one.
I'd create a set
from the labels of one dataframe, intersect it with the others and get the size of the resulting set:
num_shared = \
len(set(raw_old.label).intersection(*(s.label for s in (raw_new, master_df))))