I am stuck on this SAS code that I have to rewrite for SQL (PySpark specifically).
data data2 data3;
merge input_2(in=in2)
input_1(in=in1);
by col_1
col_2;
if in1 and in2 then do;
new_col = 'yes';
output data3;
end;
else if in1 then output data2;
run;
For "if in1 and in2", I believe that's like a SQL inner join. But for "else if in1", this would be a left join, yes?
If so, does the order of "merge input_2 input_1" matter? Is input_2 equivalent to the "left" of a SQL left join?
I am stuck on this SAS code that I have to rewrite for SQL (PySpark specifically).
data data2 data3;
merge input_2(in=in2)
input_1(in=in1);
by col_1
col_2;
if in1 and in2 then do;
new_col = 'yes';
output data3;
end;
else if in1 then output data2;
run;
For "if in1 and in2", I believe that's like a SQL inner join. But for "else if in1", this would be a left join, yes?
If so, does the order of "merge input_2 input_1" matter? Is input_2 equivalent to the "left" of a SQL left join?
Share Improve this question edited Jan 30 at 17:14 samkart 6,6543 gold badges16 silver badges34 bronze badges asked Jan 30 at 15:08 ChuckChuck 1,3052 gold badges30 silver badges64 bronze badges 1 |1 Answer
Reset to default 3- Yes, if in1 and in2 is an inner join.
- Yes, else if in1 is a left join.
- Order of MERGE does not determine the "left" dataset, but in1 (from input_1) being checked first suggests that input_1 is the left table in SQL.
You can try:
merged_df = input_1_df.join(input_2_df, on=["col_1", "col_2"], how="left")
# Create new columns based on the SAS logic
result_df = merged_df.withColumn(
"new_col",
when(col("col_1").isNotNull() & col("col_2").isNotNull(), lit("yes"))
).select(
*input_1_df.columns, "new_col"
)
# Filter into separate outputs
data3_df = result_df.filter(col("new_col") == "yes")
data2_df = result_df.filter(col("new_col").isNull())
else in1
precludesdata2
from being the left join. data2 contains the data of input_1 that is not paired to that in input_2. data2 is input_1 EXCEPT input_2 – Richard Commented Jan 30 at 17:20