最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

apache spark - How does one differentiate dataframes from connect-dataframes in Pyspark? - Stack Overflow

programmeradmin0浏览0评论

I realized that some Pyspark objects may change when used in Databricks.
For instance:

from pyspark.sql import Column, DataFrame, SparkSession

isinstance(spark, SparkSession)
isinstance(a_df, DataFrame)
isinstance(a_col, Column) 

This code isn't reliable as it depends on the context of the Spark connection.

So even something as straightforward as a_col = Column('table-column') may render the above isinstance(a_col, Column) as False when working in a Databricks context.

And in fact, you'll find that the previous objects' types are:

type(spark) > pyspark.sql.connect.session.SparkSession 
type(a_df) > pyspark.sql.connect.dataframe.DataFrame
type(a_col) > pyspark.sql.connect.column.Column

So in order to evaluate one or the other I did something like:

from pyspark.sql.connect import dataframe, column, session

isinstance(spark, (Session, session.Session))
isinstance(a_df, (DataFrame, dataframe.DataFrame))
isinstance(a_col, (Column, column.Column))

which can either get uglier with the code, or need more other tools to combine the above pyspark classes and Databricks connect classes.

So I have a tactical question and a strategic question:

  1. Is there a simpler way to check the corresponding types that doesn't differentiate between the Spark context?
  2. How does Databricks justify such a change into the user's interface when working with Spark objects?

Thank you for your help.

I realized that some Pyspark objects may change when used in Databricks.
For instance:

from pyspark.sql import Column, DataFrame, SparkSession

isinstance(spark, SparkSession)
isinstance(a_df, DataFrame)
isinstance(a_col, Column) 

This code isn't reliable as it depends on the context of the Spark connection.

So even something as straightforward as a_col = Column('table-column') may render the above isinstance(a_col, Column) as False when working in a Databricks context.

And in fact, you'll find that the previous objects' types are:

type(spark) > pyspark.sql.connect.session.SparkSession 
type(a_df) > pyspark.sql.connect.dataframe.DataFrame
type(a_col) > pyspark.sql.connect.column.Column

So in order to evaluate one or the other I did something like:

from pyspark.sql.connect import dataframe, column, session

isinstance(spark, (Session, session.Session))
isinstance(a_df, (DataFrame, dataframe.DataFrame))
isinstance(a_col, (Column, column.Column))

which can either get uglier with the code, or need more other tools to combine the above pyspark classes and Databricks connect classes.

So I have a tactical question and a strategic question:

  1. Is there a simpler way to check the corresponding types that doesn't differentiate between the Spark context?
  2. How does Databricks justify such a change into the user's interface when working with Spark objects?

Thank you for your help.

Share Improve this question edited Feb 6 at 22:21 Ged 18.3k8 gold badges48 silver badges105 bronze badges asked Feb 4 at 18:27 Diego-MXDiego-MX 2,3492 gold badges20 silver badges36 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

I find this frustrating as well when testing against different spark environments. It's no panacea, but using a type alias will make things tidier, especially if you define it in a package. I have a it in a package called "bricklayer" that is shared across many projects.

# bricklayer.types
import pyspark.sql
from pyspark.sql.connect import dataframe

from typing import TypeAlias, Union

DataFrame: TypeAlias = Union[pyspark.sql.DataFrame, dataframe.DataFrame]

Now its easy to use when running tests.

import bricklayer

def test_something(spark):
    # do test
    assert isinstance(df, bricklayer.types.DataFrame)

Now that I've written it all out, I guess it doesn't save that much time, but it does abstract some of the complexity. I have also added utilities for creating or retrieving a spark session and dbutils reference that work when running vanilla pyspark locally, Databricks connect or on a Databricks cluster.

As for your second question - I have no answer. This implementation seems clunky.

发布评论

评论列表(0)

  1. 暂无评论