What is the preferred way to assign/add a new column to a polars dataframe in .select()
or .with_columns()
?
Are there any differences between the below column assignments using .alias()
or the =
sign?
import polars as pl
df = pl.DataFrame({"A": [1, 2, 3],
"B": [1, 1, 7]})
df = df.with_columns(pl.col("A").sum().alias("a_sum"),
another_sum=pl.col("A").sum()
)
I am not sure which one to use.
What is the preferred way to assign/add a new column to a polars dataframe in .select()
or .with_columns()
?
Are there any differences between the below column assignments using .alias()
or the =
sign?
import polars as pl
df = pl.DataFrame({"A": [1, 2, 3],
"B": [1, 1, 7]})
df = df.with_columns(pl.col("A").sum().alias("a_sum"),
another_sum=pl.col("A").sum()
)
I am not sure which one to use.
Share Improve this question edited Nov 18, 2024 at 17:38 mouwsy asked Nov 18, 2024 at 17:34 mouwsymouwsy 1,99316 silver badges27 bronze badges2 Answers
Reset to default 6The advantage of alias
is that it allows you to specify a column name that wouldn't be a valid Python identifier. For example, you could use "a sum!". This can also be achieved by creating a dictionary and using **
to unpack it, passing the items as keyword arguments.
Assignment with =
cannot work in this way, as it requires a valid identifier (e.g., another_sum
).
df = df.with_columns(pl.col("A").sum().alias("a sum!"),
another_sum=pl.col("A").sum(),
**{":) \u2014 also a sum": pl.col("A").sum()}
)
Output:
shape: (3, 5)
┌─────┬─────┬────────┬─────────────┬─────────────────┐
│ A ┆ B ┆ a sum! ┆ another_sum ┆ :) — also a sum │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪════════╪═════════════╪═════════════════╡
│ 1 ┆ 1 ┆ 6 ┆ 6 ┆ 6 │
│ 2 ┆ 1 ┆ 6 ┆ 6 ┆ 6 │
│ 3 ┆ 7 ┆ 6 ┆ 6 ┆ 6 │
└─────┴─────┴────────┴─────────────┴─────────────────┘
The latter just calls alias
for you under the hood:
https://github/pola-rs/polars/blob/a0ec630b25aa847699f9c2d7389fee84749a6491/py-polars/polars/_utils/parse/expr.py#L136-L140
So, there's no advantage to either
If you find =
more readable, use that