In Python Polars, I am trying to extract the length of the lists inside a struct to re-use it in an expression.
For example, I have the code below:
import polars as pl
df = pl.DataFrame(
{
"x": [0, 4],
"y": [
{"low": [-1, 0, 1], "up": [1, 2, 3]},
{"low": [-2, -1, 0], "up": [0, 1, 2]},
],
}
)
df.with_columns(
check=pl.concat_list([pl.all_horizontal(
[
pl.col("x").ge(pl.col("y").struct["low"].list.get(i)),
pl.col("x").le(pl.col("y").struct["up"].list.get(i)),
]
) for i in range(3)]).list.max()
)
shape: (2, 3)
┌─────┬─────────────────────────┬───────┐
│ x ┆ y ┆ check │
│ --- ┆ --- ┆ --- │
│ i64 ┆ struct[2] ┆ bool │
╞═════╪═════════════════════════╪═══════╡
│ 0 ┆ {[-1, 0, 1],[1, 2, 3]} ┆ true │
│ 4 ┆ {[-2, -1, 0],[0, 1, 2]} ┆ false │
└─────┴─────────────────────────┴───────┘
and I would like to infer the length of the lists in advance (i.e. not having to hardcode the 3
), as it can change depending on the call.
The challenge I am facing, is that I need to include everything in the same expression context. I have tried as below, but it is not working as I cannot extract the value returned by one of the expressions:
df.with_columns(
check=pl.concat_list([pl.all_horizontal(
[
pl.col("x").ge(pl.col("y").struct["low"].list.get(i)),
pl.col("x").le(pl.col("y").struct["up"].list.get(i)),
]
) for i in range(pl.col("y").struct["low"].list.len())]).list.max()
)
In Python Polars, I am trying to extract the length of the lists inside a struct to re-use it in an expression.
For example, I have the code below:
import polars as pl
df = pl.DataFrame(
{
"x": [0, 4],
"y": [
{"low": [-1, 0, 1], "up": [1, 2, 3]},
{"low": [-2, -1, 0], "up": [0, 1, 2]},
],
}
)
df.with_columns(
check=pl.concat_list([pl.all_horizontal(
[
pl.col("x").ge(pl.col("y").struct["low"].list.get(i)),
pl.col("x").le(pl.col("y").struct["up"].list.get(i)),
]
) for i in range(3)]).list.max()
)
shape: (2, 3)
┌─────┬─────────────────────────┬───────┐
│ x ┆ y ┆ check │
│ --- ┆ --- ┆ --- │
│ i64 ┆ struct[2] ┆ bool │
╞═════╪═════════════════════════╪═══════╡
│ 0 ┆ {[-1, 0, 1],[1, 2, 3]} ┆ true │
│ 4 ┆ {[-2, -1, 0],[0, 1, 2]} ┆ false │
└─────┴─────────────────────────┴───────┘
and I would like to infer the length of the lists in advance (i.e. not having to hardcode the 3
), as it can change depending on the call.
The challenge I am facing, is that I need to include everything in the same expression context. I have tried as below, but it is not working as I cannot extract the value returned by one of the expressions:
df.with_columns(
check=pl.concat_list([pl.all_horizontal(
[
pl.col("x").ge(pl.col("y").struct["low"].list.get(i)),
pl.col("x").le(pl.col("y").struct["up"].list.get(i)),
]
) for i in range(pl.col("y").struct["low"].list.len())]).list.max()
)
Share
Improve this question
asked Feb 7 at 15:05
yz_jcyz_jc
1757 bronze badges
2 Answers
Reset to default 2Unfortunately, I don't see a way to use an expression for the list length here. Also, direct comparisons of list
columns are not yet natively supported.
Still, some on-the-fly exploding and imploding of the list columns could be used to achieve the desired result without relying on knowing the list lengths upfront.
(
df
.with_columns(
ge_low=(pl.col("x") >= pl.col("y").struct["low"].explode()).implode().over(pl.int_range(pl.len())),
le_up=(pl.col("x") <= pl.col("y").struct["up"].explode()).implode().over(pl.int_range(pl.len())),
)
.with_columns(
check=(pl.col("ge_low").explode() & pl.col("le_up").explode()).implode().over(pl.int_range(pl.len()))
)
)
shape: (2, 5)
┌─────┬─────────────────────────┬─────────────────────┬───────────────────────┬───────────────────────┐
│ x ┆ y ┆ ge_low ┆ le_up ┆ check │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ struct[2] ┆ list[bool] ┆ list[bool] ┆ list[bool] │
╞═════╪═════════════════════════╪═════════════════════╪═══════════════════════╪═══════════════════════╡
│ 0 ┆ {[-1, 0, 1],[1, 2, 3]} ┆ [true, true, false] ┆ [true, true, true] ┆ [true, true, false] │
│ 4 ┆ {[-2, -1, 0],[0, 1, 2]} ┆ [true, true, true] ┆ [false, false, false] ┆ [false, false, false] │
└─────┴─────────────────────────┴─────────────────────┴───────────────────────┴───────────────────────┘
import polars as pl
df = pl.DataFrame(
{
"x": [0, 4, 2],
"y": [
{"low": [-1, 0, 1], "up": [1, 2, 3]},
{"low": [-2, -1], "up": [0, 1]},
{"low": [-3, -2, -1, 0], "up": [1, 2, 3, 4]},
],
}
)
# 1. Get the length of the "low" lists:
df = df.with_columns(low_len=pl.col("y").struct["low"].list.len())
# 2. Get the length of the "up" lists:
df = df.with_columns(up_len=pl.col("y").struct["up"].list.len())
print(df) # Print the DataFrame with low_len and up_len
# 3. If you want the maximum of the two:
df = df.with_columns(max_len=pl.max([pl.col("y").struct["low"].list.len(), pl.col("y").struct["up"].list.len()]))
print(df) # Print the DataFrame with max_len
# 4. Handle potential missing struct fields gracefully (returning 0 if missing):
df = df.with_columns(
low_len=pl.col("y").struct.field("low").list.len().fill_null(0),
up_len=pl.col("y").struct.field("up").list.len().fill_null(0),
)
print(df) # Print the DataFrame with low_len and up_len, handling missing fields