apache spark - How to remove only one column, when there are multiple columns with same in name in dataframe

I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.

df.printSchema()
 --- Id String
 --- Name String
 --- Country String
 --- Id String

I would like to remove the second occurrence of Id column using PySpark.

I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.

df.printSchema()
 --- Id String
 --- Name String
 --- Country String
 --- Id String

I would like to remove the second occurrence of Id column using PySpark.

Share Improve this question edited Mar 17 at 6:34 ZygD 24.6k41 gold badges103 silver badges140 bronze badges asked Mar 11 at 13:23 N9909 2571 gold badge9 silver badges27 bronze badges

Obtain the list of column names and find the index of the column you want to drop then drop that column using the index. – JonSG Commented Mar 11 at 14:25
I did this, but its dropping both columns even-though index is different for same column name – N9909 Commented Mar 11 at 18:54

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

In version 3.3.* Spark does this by default:

df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])

joined_df = df1.join(df2, on="id", how="inner")
+---+------+------+
| id|value2|value1|
+---+------+------+
|  1|     X|     A|
|  2|     Y|     B|
+---+------+------+

Although, if you are using a different older Spark version please do the following.

Get the columns with df.columns
Use Python set to retrieve unique columns
Use the unique columns in a select expression

unique_cols = set(joined_df.columns)

joined_df.select(*unique_cols).show()

This is an example where you get several columns with the same name after a join:

df1 = spark.createDataFrame([(1, 'value_df1')], ['id', 'other_col'])
df2 = spark.createDataFrame([(1, 'value_df2')], ['id', 'other_col'])
df = df1.join(df2, 'id')
df.show()
# +---+---------+---------+
# | id|other_col|other_col|
# +---+---------+---------+
# |  1|value_df1|value_df2|
# +---+---------+---------+

Removing specified columns can be done in several ways.

drop and/or select before the join:

df1.drop('other_col').join(df2.select('id', 'other_col'), 'id').show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df2|
# +---+---------+

df1.select('id', 'other_col').join(df2.drop('other_col'), 'id').show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df1|
# +---+---------+

drop after the join:

df = df1.join(df2, 'id')
df.drop(df2.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df1|
# +---+---------+

df.drop(df1.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df2|
# +---+---------+

select after the join:

df = df1.join(df2, 'id')
df.select('id', df1.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df1|
# +---+---------+

df.select('id', df2.other_col).show()
# +---+---------+
# | id|other_col|
# +---+---------+
# |  1|value_df2|
# +---+---------+

Renaming columns before the join. After renaming you can unambiguously specify which columns to delete, but this is probably more practical in cases when you need "duplicate" columns for some other future use:

df = df1.join(df2.toDF(*[c if c in {'id'} else f'df2_{c}' for c in df2.columns]), 'id')
df.show()
# +---+---------+-------------+
# | id|other_col|df2_other_col|
# +---+---------+-------------+
# |  1|value_df1|    value_df2|
# +---+---------+-------------+
df.selectExpr("id", "array(other_col, df2_other_col) as others").show(truncate=0)
# +---+----------------------+
# |id |others                |
# +---+----------------------+
# |1  |[value_df1, value_df2]|
# +---+----------------------+

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

apache spark - How to remove only one column, when there are multiple columns with same in name in dataframe - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)