The doc page at .html# says that spark uses arrow to convert data for use in pandas UDFs. However, the data within the UDF is NOT in arrow format. What have I done wrong?
Here I have used exactly the example in the doc, but have added a line of logging within the UDF, so that I can tell the types
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
spark = (
SparkSession.builder
.config("spark.sql.execution.arrow.enabled", "true") # this was spark<3.0.0
.config("spark.sql.execution.pythonUDF.arrow.enabled", "true")
.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.appName("test").getOrCreate()
)
df = spark.createDataFrame(
[[1, "a string", ("a nested string",)]],
"long_col long, string_col string, struct_col struct<col1:string>")
@pandas_udf("col1 string, col2 long") # type: ignore[call-overload]
def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
print(s1.dtype, s2.dtype, s3.dtypes, file=open("out", "w"))
s3['col2'] = 1
return s3
df.select(func("long_col", "string_col", "struct_col")).collect();
The file "out" makes clear that both the simple string column and the struct column (which becomes a dataframe) were passed to pandas as object-type, not as arrow-backed data. Why?
pyspark 3.5.1 pyarrow 15.0.1 python 3.10 pandas 2.2.3 openjdk 20