最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pandas - pyspark limit() behaviour different for display and toPandas - Stack Overflow

programmeradmin2浏览0评论

Given a delta table on databricks where I do a query like:

display(spark.read.table("catalog.schema.my_table_a").filter(col("xyz")=='abc').limit(10))

this takes about 4 seconds.

When I do this transformation to pandas:

spark.read.table("catalog.schema.my_table_a").filter(col("xyz")=='abc').limit(10).toPandas()

this takes more than >> 20 minutes. And the number of tasks is much bigger (more than factor 100) compared to the display command.

I'm doing a limit(10) that to my understanding is executed on the cluster nodes but why does this take that long, when I afterwards use a toPandas() ?

when removing the filter in both cases the performance is the same (2 seconds).

Thanks for your help!

发布评论

评论列表(0)

  1. 暂无评论