Given a delta table on databricks where I do a query like:
display(spark.read.table("catalog.schema.my_table_a").filter(col("xyz")=='abc').limit(10))
this takes about 4 seconds.
When I do this transformation to pandas:
spark.read.table("catalog.schema.my_table_a").filter(col("xyz")=='abc').limit(10).toPandas()
this takes more than >> 20 minutes. And the number of tasks is much bigger (more than factor 100) compared to the display command.
I'm doing a limit(10) that to my understanding is executed on the cluster nodes but why does this take that long, when I afterwards use a toPandas() ?
when removing the filter in both cases the performance is the same (2 seconds).
Thanks for your help!