最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Update the column value if the Id in the dataframe is in the list - Stack Overflow

programmeradmin0浏览0评论

I am working on a very large dataset that has over 800,000 records and I'm trying to update the column value based on the ID column. If

# Imports
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('SparkByExamples') \
                    .getOrCreate()

data = [('James','Smith','M','N'), ('Anna','Rose','F','N'),
  ('Robert','Williams','M','N')
]
columns = ["firstname","lastname","gender","deleted"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

# Output
+---------+--------+------+-------+
|firstname|lastname|gender|deleted|
+---------+--------+------+-------+
|    James|   Smith|     M|      N|
|     Anna|    Rose|     F|      N|
|   Robert|Williams|     M|      N|
+---------+--------+------+-------+

I also have a list of names for which I want to update the value of the deleted column.

deleted_list = ['James', 'Robert']

Currently, I'm using a code like this:

df = df.withColumn("deleted_new", when(col("firstname").isin(deleted_list ), "Y").otherwise("N"))
df = df.drop('deleted').withColumnRenamed('deleted_new', 'deleted')

However, it is taking a really long time, as my data frame and list are huge. Is there a PySpark way to make this faster?

# Explected output
+---------+--------+------+-------+
|firstname|lastname|gender|deleted|
+---------+--------+------+-------+
|    James|   Smith|     M|      Y|
|     Anna|    Rose|     F|      N|
|   Robert|Williams|     M|      Y|
+---------+--------+------+-------+

++ Well the list is probably close to a million or more. Also 800K I think it takes about 10 min as I'm testing with sample data, but eventually, it could be millions of records that I will have to process...

I am working on a very large dataset that has over 800,000 records and I'm trying to update the column value based on the ID column. If

# Imports
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('SparkByExamples') \
                    .getOrCreate()

data = [('James','Smith','M','N'), ('Anna','Rose','F','N'),
  ('Robert','Williams','M','N')
]
columns = ["firstname","lastname","gender","deleted"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()

# Output
+---------+--------+------+-------+
|firstname|lastname|gender|deleted|
+---------+--------+------+-------+
|    James|   Smith|     M|      N|
|     Anna|    Rose|     F|      N|
|   Robert|Williams|     M|      N|
+---------+--------+------+-------+

I also have a list of names for which I want to update the value of the deleted column.

deleted_list = ['James', 'Robert']

Currently, I'm using a code like this:

df = df.withColumn("deleted_new", when(col("firstname").isin(deleted_list ), "Y").otherwise("N"))
df = df.drop('deleted').withColumnRenamed('deleted_new', 'deleted')

However, it is taking a really long time, as my data frame and list are huge. Is there a PySpark way to make this faster?

# Explected output
+---------+--------+------+-------+
|firstname|lastname|gender|deleted|
+---------+--------+------+-------+
|    James|   Smith|     M|      Y|
|     Anna|    Rose|     F|      N|
|   Robert|Williams|     M|      Y|
+---------+--------+------+-------+

++ Well the list is probably close to a million or more. Also 800K I think it takes about 10 min as I'm testing with sample data, but eventually, it could be millions of records that I will have to process...

Share Improve this question edited Mar 30 at 18:12 Sarah asked Mar 29 at 22:04 SarahSarah 551 silver badge5 bronze badges 3
  • When you said your list is huge, what is the length of list? – Jonathan Commented Mar 30 at 3:53
  • if it's a really big list, you might be able to turn your list into another pyspark dataframe and join it with df – Derek O Commented Mar 30 at 15:45
  • 1 800K is like pennies for spark and the operation you're performing is quite basic. how long is your "long time"? and what are you expecting it to be? – samkart Commented Mar 30 at 15:54
Add a comment  | 

1 Answer 1

Reset to default 1

You could try parallelizing deleted_list and perform a left join:


to_delete_df = spark.createDataFrame([(n, 'Y') for n in deleted_list], ['firstname', 'deleted_new'])

df.join(to_delete_df, ['firstname'], 'left')\
  .withColumn("deleted", F.coalesce(F.col("deleted_new"), F.col("deleted")))\
  .drop("deleted_new").show()
+---------+--------+------+-------+
|firstname|lastname|gender|deleted|
+---------+--------+------+-------+
|    James|   Smith|     M|      Y|
|     Anna|    Rose|     F|      N|
|   Robert|Williams|     M|      Y|
+---------+--------+------+-------+

By using isin on a very large list, you generate a very heavy execution plan and spark generally has troubles with that.

发布评论

评论列表(0)

  1. 暂无评论