最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

join - What is the equivalent SQL for this SAS merge? - Stack Overflow

programmeradmin0浏览0评论

I am stuck on this SAS code that I have to rewrite for SQL (PySpark specifically).

data data2 data3;
merge input_2(in=in2) 
      input_1(in=in1);
by col_1
   col_2;

    if in1 and in2 then do;
        new_col = 'yes';
        output data3; 
    end;
    else if in1 then output data2; 
run;

For "if in1 and in2", I believe that's like a SQL inner join. But for "else if in1", this would be a left join, yes?

If so, does the order of "merge input_2 input_1" matter? Is input_2 equivalent to the "left" of a SQL left join?

I am stuck on this SAS code that I have to rewrite for SQL (PySpark specifically).

data data2 data3;
merge input_2(in=in2) 
      input_1(in=in1);
by col_1
   col_2;

    if in1 and in2 then do;
        new_col = 'yes';
        output data3; 
    end;
    else if in1 then output data2; 
run;

For "if in1 and in2", I believe that's like a SQL inner join. But for "else if in1", this would be a left join, yes?

If so, does the order of "merge input_2 input_1" matter? Is input_2 equivalent to the "left" of a SQL left join?

Share Improve this question edited Jan 30 at 17:14 samkart 6,6543 gold badges16 silver badges34 bronze badges asked Jan 30 at 15:08 ChuckChuck 1,3052 gold badges30 silver badges64 bronze badges 1
  • 2 the else in else in1 precludes data2 from being the left join. data2 contains the data of input_1 that is not paired to that in input_2. data2 is input_1 EXCEPT input_2 – Richard Commented Jan 30 at 17:20
Add a comment  | 

1 Answer 1

Reset to default 3
  • Yes, if in1 and in2 is an inner join.
  • Yes, else if in1 is a left join.
  • Order of MERGE does not determine the "left" dataset, but in1 (from input_1) being checked first suggests that input_1 is the left table in SQL.

You can try:

merged_df = input_1_df.join(input_2_df, on=["col_1", "col_2"], how="left")

# Create new columns based on the SAS logic
result_df = merged_df.withColumn(
    "new_col",
    when(col("col_1").isNotNull() & col("col_2").isNotNull(), lit("yes"))
).select(
    *input_1_df.columns, "new_col"  
)

# Filter into separate outputs
data3_df = result_df.filter(col("new_col") == "yes")
data2_df = result_df.filter(col("new_col").isNull())

发布评论

评论列表(0)

  1. 暂无评论