I have two columns in PySpark dataframe in Databricks:
- YearWeek (2417)
- Day (0,1,2,3,4,5) - day of a week
Need to create a column with a date.
I was trying to do that in sql(date_add…) and also strptime
in Python. SQL returns wrong data and strptime
does not work with withColumn
.
I have two columns in PySpark dataframe in Databricks:
- YearWeek (2417)
- Day (0,1,2,3,4,5) - day of a week
Need to create a column with a date.
I was trying to do that in sql(date_add…) and also strptime
in Python. SQL returns wrong data and strptime
does not work with withColumn
.
2 Answers
Reset to default 0If you refer to this documentation, it states that week based pattern recognition is no longer supported after spark 3.0. So you will need to use the legacy option. You mentioned that you can't use legacy mode but haven't mentioned why not.
If you really cannot use the legacy option, you will need to create the day in the year using 7X(week - 1) + (day of week). Then use the following SQL to get your dates
%sql
SELECT yearweek, day, to_date(CONCAT('20', LEFT(yearweek, 2), ((RIGHT(yearweek, 2)-1)*7) + day), 'yyyyD') as date from test1;
This is assuming that all the years are 20XX. You will need more information to understand if it is > 2099.
I am not yet aware of Spark-native way to convert ISO week dates into proper dates. I would use pandas_udf
, because it is more efficient than regular udf
, but it's possible both ways.
Example data:
from pyspark.sql import functions as F
df = spark.createDataFrame([('2510', '0'), ('2140', '0')], ['YearWeek', 'Day'])
df.show()
# +--------+---+
# |YearWeek|Day|
# +--------+---+
# | 2510| 0|
# | 2140| 0|
# +--------+---+
Your data contains 0 which is strange. ISO week date format goes from 1 (Monday) to 7 (Sunday). You will have to make a conversion. Not knowing what is represented by zero in your data, I simply did this:
day = (F.col('Day') + 1).cast('int').cast('string')
pandas_udf
import pandas as pd
@F.pandas_udf('date')
def to_date_pudf(yw: pd.Series, day: pd.Series) -> pd.Series:
return pd.to_datetime('20' + yw + day, format='%G%V%u')
df.withColumn('pudf', to_date_pudf('YearWeek', day)).show()
# +--------+---+----------+
# |YearWeek|Day| pudf|
# +--------+---+----------+
# | 2510| 0|2025-03-03|
# | 2140| 0|2021-10-04|
# +--------+---+----------+
udf
from datetime import datetime
@F.udf('date')
def to_date_udf(yw, day):
return datetime.strptime('20' + yw + day, '%G%V%u')
df.withColumn('udf', to_date_udf('YearWeek', day)).show()
# +--------+---+----------+
# |YearWeek|Day| udf|
# +--------+---+----------+
# | 2510| 0|2025-03-03|
# | 2140| 0|2021-10-04|
# +--------+---+----------+
to_date
function which accepts a string & format and converts to date format – samkart Commented Mar 12 at 17:297*week + day of the week
? – Derek O Commented Mar 13 at 1:21