最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

apache spark - How to remove only one column, when there are multiple columns with same in name in dataframe - Stack Overflow

programmeradmin1浏览0评论

I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.

df.printSchema()
 --- Id String
 --- Name String
 --- Country String
 --- Id String

I would like to remove the second occurrence of Id column using PySpark.

I'm trying to remove one column even though if there multiple columns with same name in Spark dataframe after join operation performed.

df.printSchema()
 --- Id String
 --- Name String
 --- Country String
 --- Id String

I would like to remove the second occurrence of Id column using PySpark.

Share Improve this question edited Mar 17 at 6:34 ZygD 24.6k41 gold badges103 silver badges140 bronze badges asked Mar 11 at 13:23 N9909N9909 2571 gold badge9 silver badges27 bronze badges 2
  • Obtain the list of column names and find the index of the column you want to drop then drop that column using the index. – JonSG Commented Mar 11 at 14:25
  • I did this, but its dropping both columns even-though index is different for same column name – N9909 Commented Mar 11 at 18:54
Add a comment  | 

2 Answers 2

Reset to default 0

In version 3.3.* Spark does this by default:

df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])

joined_df = df1.join(df2, on="id", how="inner")
+---+------+------+
| id|value2|value1|
+---+------+------+
|  1|     X|     A|
|  2|     Y|     B|
+---+------+------+

Although, if you are using a different older Spark version please do the following.

  1. Get the columns with df.columns
  2. Use Python set to retrieve unique columns
  3. Use the unique columns in a select expression
unique_cols = set(joined_df.columns)

joined_df.select(*unique_cols).show()

This is an example where you get several columns with the same name after a join:

df1 = spark.createDataFrame([(1, 'value_df1')], ['id', 'other_col'])
df2 = spark.createDataFrame([(1, 'value_df2')], ['id', 'other_col'])
df = df1.join(df2, 'id')
df.show()
# +---+---------+---------+
# | id|other_col|other_col|
# +---+---------+---------+
# |  1|value_df1|value_df2|
# +---+---------+---------+

Removing specified columns can be done in several ways.

  • drop and/or select before the join:

    df1.drop('other_col').join(df2.select('id', 'other_col'), 'id').show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df2|
    # +---+---------+
    
    df1.select('id', 'other_col').join(df2.drop('other_col'), 'id').show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df1|
    # +---+---------+
    
  • drop after the join:

    df = df1.join(df2, 'id')
    df.drop(df2.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df1|
    # +---+---------+
    
    df.drop(df1.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df2|
    # +---+---------+
    
  • select after the join:

    df = df1.join(df2, 'id')
    df.select('id', df1.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df1|
    # +---+---------+
    
    df.select('id', df2.other_col).show()
    # +---+---------+
    # | id|other_col|
    # +---+---------+
    # |  1|value_df2|
    # +---+---------+
    
  • Renaming columns before the join. After renaming you can unambiguously specify which columns to delete, but this is probably more practical in cases when you need "duplicate" columns for some other future use:

    df = df1.join(df2.toDF(*[c if c in {'id'} else f'df2_{c}' for c in df2.columns]), 'id')
    df.show()
    # +---+---------+-------------+
    # | id|other_col|df2_other_col|
    # +---+---------+-------------+
    # |  1|value_df1|    value_df2|
    # +---+---------+-------------+
    df.selectExpr("id", "array(other_col, df2_other_col) as others").show(truncate=0)
    # +---+----------------------+
    # |id |others                |
    # +---+----------------------+
    # |1  |[value_df1, value_df2]|
    # +---+----------------------+
    

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论