I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.
df.printSchema()
--- Id String
--- Name String
--- Country String
--- Id String
I would like to remove the second occurrence of Id column using PySpark.
I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.
df.printSchema()
--- Id String
--- Name String
--- Country String
--- Id String
I would like to remove the second occurrence of Id column using PySpark.
Share Improve this question edited Mar 17 at 6:34 ZygD 24.6k41 gold badges103 silver badges140 bronze badges asked Mar 11 at 13:23 N9909N9909 2571 gold badge9 silver badges27 bronze badges 2- Obtain the list of column names and find the index of the column you want to drop then drop that column using the index. – JonSG Commented Mar 11 at 14:25
- I did this, but its dropping both columns even-though index is different for same column name – N9909 Commented Mar 11 at 18:54
2 Answers
Reset to default 0In version 3.3.* Spark does this by default:
df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])
joined_df = df1.join(df2, on="id", how="inner")
+---+------+------+
| id|value2|value1|
+---+------+------+
| 1| X| A|
| 2| Y| B|
+---+------+------+
Although, if you are using a different older Spark version please do the following.
- Get the columns with
df.columns
- Use Python
set
to retrieve unique columns - Use the unique columns in a select expression
unique_cols = set(joined_df.columns)
joined_df.select(*unique_cols).show()
This is an example where you get several columns with the same name after a join:
df1 = spark.createDataFrame([(1, 'value_df1')], ['id', 'other_col'])
df2 = spark.createDataFrame([(1, 'value_df2')], ['id', 'other_col'])
df = df1.join(df2, 'id')
df.show()
# +---+---------+---------+
# | id|other_col|other_col|
# +---+---------+---------+
# | 1|value_df1|value_df2|
# +---+---------+---------+
Removing specified columns can be done in several ways.
drop
and/orselect
before the join:df1.drop('other_col').join(df2.select('id', 'other_col'), 'id').show() # +---+---------+ # | id|other_col| # +---+---------+ # | 1|value_df2| # +---+---------+ df1.select('id', 'other_col').join(df2.drop('other_col'), 'id').show() # +---+---------+ # | id|other_col| # +---+---------+ # | 1|value_df1| # +---+---------+
drop
after the join:df = df1.join(df2, 'id') df.drop(df2.other_col).show() # +---+---------+ # | id|other_col| # +---+---------+ # | 1|value_df1| # +---+---------+ df.drop(df1.other_col).show() # +---+---------+ # | id|other_col| # +---+---------+ # | 1|value_df2| # +---+---------+
select
after the join:df = df1.join(df2, 'id') df.select('id', df1.other_col).show() # +---+---------+ # | id|other_col| # +---+---------+ # | 1|value_df1| # +---+---------+ df.select('id', df2.other_col).show() # +---+---------+ # | id|other_col| # +---+---------+ # | 1|value_df2| # +---+---------+
Renaming columns before the join. After renaming you can unambiguously specify which columns to delete, but this is probably more practical in cases when you need "duplicate" columns for some other future use:
df = df1.join(df2.toDF(*[c if c in {'id'} else f'df2_{c}' for c in df2.columns]), 'id') df.show() # +---+---------+-------------+ # | id|other_col|df2_other_col| # +---+---------+-------------+ # | 1|value_df1| value_df2| # +---+---------+-------------+ df.selectExpr("id", "array(other_col, df2_other_col) as others").show(truncate=0) # +---+----------------------+ # |id |others | # +---+----------------------+ # |1 |[value_df1, value_df2]| # +---+----------------------+