最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

apache spark - PySpark - Convert yearWeek and Day into date - Stack Overflow

programmeradmin0浏览0评论

I have two columns in PySpark dataframe in Databricks:

  1. YearWeek (2417)
  2. Day (0,1,2,3,4,5) - day of a week

Need to create a column with a date. I was trying to do that in sql(date_add…) and also strptime in Python. SQL returns wrong data and strptime does not work with withColumn.

I have two columns in PySpark dataframe in Databricks:

  1. YearWeek (2417)
  2. Day (0,1,2,3,4,5) - day of a week

Need to create a column with a date. I was trying to do that in sql(date_add…) and also strptime in Python. SQL returns wrong data and strptime does not work with withColumn.

Share Improve this question edited Mar 16 at 23:36 ZygD 24.6k41 gold badges103 silver badges140 bronze badges asked Mar 12 at 17:21 inspireddinspiredd 2275 silver badges17 bronze badges 3
  • see to_date function which accepts a string & format and converts to date format – samkart Commented Mar 12 at 17:29
  • There is no week function. Only exists in legacy mode which i cant use – inspiredd Commented Mar 12 at 17:41
  • could you calculate the day of the year using 7*week + day of the week? – Derek O Commented Mar 13 at 1:21
Add a comment  | 

2 Answers 2

Reset to default 0

If you refer to this documentation, it states that week based pattern recognition is no longer supported after spark 3.0. So you will need to use the legacy option. You mentioned that you can't use legacy mode but haven't mentioned why not.

If you really cannot use the legacy option, you will need to create the day in the year using 7X(week - 1) + (day of week). Then use the following SQL to get your dates

%sql
SELECT yearweek, day, to_date(CONCAT('20', LEFT(yearweek, 2), ((RIGHT(yearweek, 2)-1)*7) + day), 'yyyyD') as date from test1;

This is assuming that all the years are 20XX. You will need more information to understand if it is > 2099.

I am not yet aware of Spark-native way to convert ISO week dates into proper dates. I would use pandas_udf, because it is more efficient than regular udf, but it's possible both ways.

Example data:

from pyspark.sql import functions as F

df = spark.createDataFrame([('2510', '0'), ('2140', '0')], ['YearWeek', 'Day'])
df.show()
# +--------+---+
# |YearWeek|Day|
# +--------+---+
# |    2510|  0|
# |    2140|  0|
# +--------+---+

Your data contains 0 which is strange. ISO week date format goes from 1 (Monday) to 7 (Sunday). You will have to make a conversion. Not knowing what is represented by zero in your data, I simply did this:

day = (F.col('Day') + 1).cast('int').cast('string')

pandas_udf

import pandas as pd

@F.pandas_udf('date')
def to_date_pudf(yw: pd.Series, day: pd.Series) -> pd.Series:
    return pd.to_datetime('20' + yw + day, format='%G%V%u')

df.withColumn('pudf', to_date_pudf('YearWeek', day)).show()
# +--------+---+----------+
# |YearWeek|Day|      pudf|
# +--------+---+----------+
# |    2510|  0|2025-03-03|
# |    2140|  0|2021-10-04|
# +--------+---+----------+

udf

from datetime import datetime

@F.udf('date')
def to_date_udf(yw, day):
    return datetime.strptime('20' + yw + day, '%G%V%u')

df.withColumn('udf', to_date_udf('YearWeek', day)).show()
# +--------+---+----------+
# |YearWeek|Day|       udf|
# +--------+---+----------+
# |    2510|  0|2025-03-03|
# |    2140|  0|2021-10-04|
# +--------+---+----------+
发布评论

评论列表(0)

  1. 暂无评论