When working with previous Spark Versions, I am always confused when it comes to specifying column names: should I use String
or a col object
.
Example of regexp_replace from 3.1.2:
pyspark.sql.functions.regexp_replace(str, pattern, replacement)[source]
I was running a cluster with Version 3.1.2 and both works:
df1.withColumn("modality",F.regexp_replace(F.col("name"),"i","")).display()
df1.withColumn("modality",F.regexp_replace("name","i","")).display()
From the docu I would have assumed that only a String is allowed, but both works. How can I see in the API docu, if also a col object is allowed (in the latest api this is pretty clear, but not in the previous ones).
When working with previous Spark Versions, I am always confused when it comes to specifying column names: should I use String
or a col object
.
Example of regexp_replace from 3.1.2:
pyspark.sql.functions.regexp_replace(str, pattern, replacement)[source]
I was running a cluster with Version 3.1.2 and both works:
df1.withColumn("modality",F.regexp_replace(F.col("name"),"i","")).display()
df1.withColumn("modality",F.regexp_replace("name","i","")).display()
From the docu I would have assumed that only a String is allowed, but both works. How can I see in the API docu, if also a col object is allowed (in the latest api this is pretty clear, but not in the previous ones).
Share Improve this question asked Jan 19 at 16:30 user3579222user3579222 1,42019 silver badges37 bronze badges1 Answer
Reset to default 2When you click on the source button of the 3.1.2 doc you find the source code of regexp_replace
:
def regexp_replace(str, pattern, replacement):
r"""Replace all substrings of the specified string value that match regexp with rep.
.. versionadded:: 1.5.0
Examples
--------
>>> df = spark.createDataFrame([('100-200',)], ['str'])
>>> df.select(regexp_replace('str', r'(\d+)', '--').alias('d')).collect()
[Row(d='-----')]
"""
sc = SparkContext._active_spark_context
jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement)
return Column(jc)
You see that the str
argument is not used directly but wrapped within the _to_java_column
function. The source code of _to_java_column
clearly shows that it works with both column names (string) and column objects:
def _to_java_column(col: "ColumnOrName") -> "JavaObject":
if isinstance(col, Column):
jcol = col._jc
elif isinstance(col, str):
jcol = _create_column_from_name(col)
else:
raise PySparkTypeError(
errorClass="NOT_COLUMN_OR_STR",
messageParameters={"arg_name": "col", "arg_type": type(col).__name__},
)
return jcol
When browsing the source page of functions
, you see that _to_java_column
is omnipresent, which means that for most functions (or even all but I didn't check), both column names of column object can be used.