pyspark - Read previous Spark APIs

When working with previous Spark Versions, I am always confused when it comes to specifying column names: should I use String or a col object.

Example of regexp_replace from 3.1.2:

pyspark.sql.functions.regexp_replace(str, pattern, replacement)[source]

I was running a cluster with Version 3.1.2 and both works:

df1.withColumn("modality",F.regexp_replace(F.col("name"),"i","")).display()
df1.withColumn("modality",F.regexp_replace("name","i","")).display()

From the docu I would have assumed that only a String is allowed, but both works. How can I see in the API docu, if also a col object is allowed (in the latest api this is pretty clear, but not in the previous ones).

When working with previous Spark Versions, I am always confused when it comes to specifying column names: should I use String or a col object.

Example of regexp_replace from 3.1.2:

pyspark.sql.functions.regexp_replace(str, pattern, replacement)[source]

I was running a cluster with Version 3.1.2 and both works:

df1.withColumn("modality",F.regexp_replace(F.col("name"),"i","")).display()
df1.withColumn("modality",F.regexp_replace("name","i","")).display()

Share Improve this question asked Jan 19 at 16:30 user3579222 1,42019 silver badges37 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

When you click on the source button of the 3.1.2 doc you find the source code of regexp_replace:

def regexp_replace(str, pattern, replacement):
    r"""Replace all substrings of the specified string value that match regexp with rep.

    .. versionadded:: 1.5.0

    Examples
    --------
    >>> df = spark.createDataFrame([('100-200',)], ['str'])
    >>> df.select(regexp_replace('str', r'(\d+)', '--').alias('d')).collect()
    [Row(d='-----')]
    """
    sc = SparkContext._active_spark_context
    jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement)
    return Column(jc)

You see that the str argument is not used directly but wrapped within the _to_java_column function. The source code of _to_java_column clearly shows that it works with both column names (string) and column objects:

def _to_java_column(col: "ColumnOrName") -> "JavaObject":
    if isinstance(col, Column):
        jcol = col._jc
    elif isinstance(col, str):
        jcol = _create_column_from_name(col)
    else:
        raise PySparkTypeError(
            errorClass="NOT_COLUMN_OR_STR",
            messageParameters={"arg_name": "col", "arg_type": type(col).__name__},
        )
    return jcol

When browsing the source page of functions, you see that _to_java_column is omnipresent, which means that for most functions (or even all but I didn't check), both column names of column object can be used.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pyspark - Read previous Spark APIs - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)