In Python Polars, I have a dataframe like the below:
df = pl.DataFrame(
{"sets": [[1, 2, 3], [1, 2], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)
shape: (2, 2)
┌───────────┬──────────────────┐
│ sets ┆ optional_members │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪══════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3] │
│ [1, 0] ┆ [1, 2] │
└───────────┴──────────────────┘
I would like to build an expression that gets me the elements of the first column that are in the second, keeping the shape of the former, i.e:
shape: (2, 3)
┌───────────┬──────────────────┬─────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3] ┆ [true, false, true] │
│ [1, 0] ┆ [1, 2] ┆ [true, false] │
└───────────┴──────────────────┴─────────────────────┘
I have tried using eval over the first list, something like:
func = lambda x, y: y.list.contains(x)
df.with_columns(contains=
pl.col("optional_members")
.list.
eval(func(pl.element(), pl.col("optional_members"))))
But the pl.col()
expression cannot be in an eval.
How could we aaddress this while keeping the solution in a single expression?
Thanks to @roman comment, a point need to be made: the check should be done regardless of the position.
In Python Polars, I have a dataframe like the below:
df = pl.DataFrame(
{"sets": [[1, 2, 3], [1, 2], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)
shape: (2, 2)
┌───────────┬──────────────────┐
│ sets ┆ optional_members │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪══════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3] │
│ [1, 0] ┆ [1, 2] │
└───────────┴──────────────────┘
I would like to build an expression that gets me the elements of the first column that are in the second, keeping the shape of the former, i.e:
shape: (2, 3)
┌───────────┬──────────────────┬─────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3] ┆ [true, false, true] │
│ [1, 0] ┆ [1, 2] ┆ [true, false] │
└───────────┴──────────────────┴─────────────────────┘
I have tried using eval over the first list, something like:
func = lambda x, y: y.list.contains(x)
df.with_columns(contains=
pl.col("optional_members")
.list.
eval(func(pl.element(), pl.col("optional_members"))))
But the pl.col()
expression cannot be in an eval.
How could we aaddress this while keeping the solution in a single expression?
Thanks to @roman comment, a point need to be made: the check should be done regardless of the position.
Share Improve this question edited Nov 20, 2024 at 16:14 yz_jc asked Nov 20, 2024 at 15:09 yz_jcyz_jc 1997 bronze badges 3 |3 Answers
Reset to default 1If you need to compare elements on the same position:
df.with_columns(
(pl.col.sets.list.explode() == pl.col.optional_members.list.explode())
.implode()
.over(pl.int_range(pl.len()))
.alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3] ┆ [true, true, true] │
│ [1, 2] ┆ [1, 2] ┆ [true, true] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴────────────────────┘
If you need to compare elements regardless of position then it's a bit more complicated:
df.with_columns(
pl.col.sets.explode().is_in(pl.col.optional_members.explode())
.implode()
.over(pl.int_range(pl.len()))
.alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3] ┆ [true, true, true] │
│ [1, 2] ┆ [1, 2] ┆ [true, true] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴────────────────────┘
If your lists are not very long or if all the lists are the same length, you can also try to use
pl.Expr.list.get()
.
m = df.select(pl.col.sets.list.len().max()).item()
df.with_columns(
pl.concat_list(
pl.col.optional_members.list.contains(pl.col.sets.list.get(i, null_on_oob=True))
for i in range(m)
).list.head(pl.col.sets.list.len())
.alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3] ┆ [true, true, true] │
│ [1, 2] ┆ [1, 2] ┆ [true, true] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴────────────────────┘
One possibility would be to explode
, compare, then group_by.agg
:
(df.with_row_index() # to be able to group back to original rows
.explode(['sets', 'optional_members'])
.with_columns(pl.col('sets').eq(pl.col('optional_members')).alias('result'))
.group_by('index').agg(pl.col(['sets', 'optional_members', 'result']))
.drop('index')
)
Another option could be to compute the difference of values and cast
to booleans:
df.with_columns(pl.col('sets').sub(pl.col('optional_members'))
.cast(pl.List(pl.Boolean)).alias('result')
)
This would invert the result, but you could tweak it using:
df.with_columns(pl.lit(1).sub(
pl.col('sets').sub(pl.col('optional_members'))
.cast(pl.List(pl.Boolean))
).cast(pl.List(pl.Boolean)).alias('result')
)
Output:
┌───────────┬──────────────────┬─────────────────────┐
│ sets ┆ optional_members ┆ result │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3] ┆ [true, false, true] │
│ [1, 0] ┆ [1, 2] ┆ [true, false] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴─────────────────────┘
Based on some of the feedback, and taking into account the guideline on structs, I think I found an interim solution that works regardless of the dtype (and the order), although it involves map_elements
:
df = pl.DataFrame(
{"sets": [[2, 1, 3], [2, 3], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)
def test(m, n):
return [element in n for element in m]
df.with_columns(contains=
pl.struct(["sets", "optional_members"])
.map_elements(lambda x: test(x["sets"], x["optional_members"])))
┌───────────┬──────────────────┬────────────────────┐
│ sets ┆ optional_members ┆ contains │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[bool] │
╞═══════════╪══════════════════╪════════════════════╡
│ [2, 1, 3] ┆ [1, 2, 3] ┆ [true, true, true] │
│ [2, 3] ┆ [1, 2] ┆ [true, false] │
│ [9, 10] ┆ [9, 0] ┆ [true, false] │
└───────────┴──────────────────┴────────────────────┘
[1,2]
and second column contains[2,1
] do you want to have[true, true]
or[false, false]
? – roman Commented Nov 20, 2024 at 15:47.list.set_intersection()
to actually get the intersection. stackoverflow/a/79182194 may be relevant depending on what you need to do next with the bools. – jqurious Commented Nov 20, 2024 at 16:51