I realized that some Pyspark objects may change when used in Databricks.
For instance:
from pyspark.sql import Column, DataFrame, SparkSession
isinstance(spark, SparkSession)
isinstance(a_df, DataFrame)
isinstance(a_col, Column)
This code isn't reliable as it depends on the context of the Spark connection.
So even something as straightforward as
a_col = Column('table-column')
may render the above isinstance(a_col, Column)
as False
when working in a Databricks context.
And in fact, you'll find that the previous objects' types are:
type(spark) > pyspark.sql.connect.session.SparkSession
type(a_df) > pyspark.sql.connect.dataframe.DataFrame
type(a_col) > pyspark.sql.connect.column.Column
So in order to evaluate one or the other I did something like:
from pyspark.sql.connect import dataframe, column, session
isinstance(spark, (Session, session.Session))
isinstance(a_df, (DataFrame, dataframe.DataFrame))
isinstance(a_col, (Column, column.Column))
which can either get uglier with the code, or need more other tools to combine the above pyspark classes and Databricks connect classes.
So I have a tactical question and a strategic question:
- Is there a simpler way to check the corresponding types that doesn't differentiate between the Spark context?
- How does Databricks justify such a change into the user's interface when working with Spark objects?
Thank you for your help.
I realized that some Pyspark objects may change when used in Databricks.
For instance:
from pyspark.sql import Column, DataFrame, SparkSession
isinstance(spark, SparkSession)
isinstance(a_df, DataFrame)
isinstance(a_col, Column)
This code isn't reliable as it depends on the context of the Spark connection.
So even something as straightforward as
a_col = Column('table-column')
may render the above isinstance(a_col, Column)
as False
when working in a Databricks context.
And in fact, you'll find that the previous objects' types are:
type(spark) > pyspark.sql.connect.session.SparkSession
type(a_df) > pyspark.sql.connect.dataframe.DataFrame
type(a_col) > pyspark.sql.connect.column.Column
So in order to evaluate one or the other I did something like:
from pyspark.sql.connect import dataframe, column, session
isinstance(spark, (Session, session.Session))
isinstance(a_df, (DataFrame, dataframe.DataFrame))
isinstance(a_col, (Column, column.Column))
which can either get uglier with the code, or need more other tools to combine the above pyspark classes and Databricks connect classes.
So I have a tactical question and a strategic question:
- Is there a simpler way to check the corresponding types that doesn't differentiate between the Spark context?
- How does Databricks justify such a change into the user's interface when working with Spark objects?
Thank you for your help.
Share Improve this question edited Feb 6 at 22:21 Ged 18.3k8 gold badges48 silver badges105 bronze badges asked Feb 4 at 18:27 Diego-MXDiego-MX 2,3492 gold badges20 silver badges36 bronze badges1 Answer
Reset to default 1I find this frustrating as well when testing against different spark environments. It's no panacea, but using a type alias will make things tidier, especially if you define it in a package. I have a it in a package called "bricklayer" that is shared across many projects.
# bricklayer.types
import pyspark.sql
from pyspark.sql.connect import dataframe
from typing import TypeAlias, Union
DataFrame: TypeAlias = Union[pyspark.sql.DataFrame, dataframe.DataFrame]
Now its easy to use when running tests.
import bricklayer
def test_something(spark):
# do test
assert isinstance(df, bricklayer.types.DataFrame)
Now that I've written it all out, I guess it doesn't save that much time, but it does abstract some of the complexity. I have also added utilities for creating or retrieving a spark session and dbutils reference that work when running vanilla pyspark locally, Databricks connect or on a Databricks cluster.
As for your second question - I have no answer. This implementation seems clunky.