最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pyarrow - Arrow data in pyspark? - Stack Overflow

programmeradmin3浏览0评论

The doc page at .html# says that spark uses arrow to convert data for use in pandas UDFs. However, the data within the UDF is NOT in arrow format. What have I done wrong?

Here I have used exactly the example in the doc, but have added a line of logging within the UDF, so that I can tell the types

from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf

spark = (
    SparkSession.builder
    .config("spark.sql.execution.arrow.enabled", "true")  # this was spark<3.0.0
    .config("spark.sql.execution.pythonUDF.arrow.enabled", "true")
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")
    .appName("test").getOrCreate()
)
df = spark.createDataFrame(
    [[1, "a string", ("a nested string",)]],
    "long_col long, string_col string, struct_col struct<col1:string>")

@pandas_udf("col1 string, col2 long")  # type: ignore[call-overload]
def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
    print(s1.dtype, s2.dtype, s3.dtypes, file=open("out", "w"))
    s3['col2'] = 1
    return s3

df.select(func("long_col", "string_col", "struct_col")).collect();

The file "out" makes clear that both the simple string column and the struct column (which becomes a dataframe) were passed to pandas as object-type, not as arrow-backed data. Why?

pyspark 3.5.1 pyarrow 15.0.1 python 3.10 pandas 2.2.3 openjdk 20

发布评论

评论列表(0)

  1. 暂无评论