最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Polars-Python: Compare list columns - Stack Overflow

programmeradmin0浏览0评论

In Python Polars, I have a dataframe like the below:

df = pl.DataFrame(
    {"sets": [[1, 2, 3], [1, 2], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)

shape: (2, 2)
┌───────────┬──────────────────┐
│ sets      ┆ optional_members │
│ ---       ┆ ---              │
│ list[i64] ┆ list[i64]        │
╞═══════════╪══════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3]        │
│ [1, 0]    ┆ [1, 2]           │
└───────────┴──────────────────┘

I would like to build an expression that gets me the elements of the first column that are in the second, keeping the shape of the former, i.e:


shape: (2, 3)
┌───────────┬──────────────────┬─────────────────────┐
│ sets      ┆ optional_members ┆ result              │
│ ---       ┆ ---              ┆ ---                 │
│ list[i64] ┆ list[i64]        ┆ list[bool]          │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3]        ┆ [true, false, true] │
│ [1, 0]    ┆ [1, 2]           ┆ [true, false]       │
└───────────┴──────────────────┴─────────────────────┘

I have tried using eval over the first list, something like:


func = lambda x, y: y.list.contains(x)

df.with_columns(contains=
                pl.col("optional_members")
                .list.
                eval(func(pl.element(), pl.col("optional_members"))))

But the pl.col() expression cannot be in an eval.

How could we aaddress this while keeping the solution in a single expression?

Thanks to @roman comment, a point need to be made: the check should be done regardless of the position.

In Python Polars, I have a dataframe like the below:

df = pl.DataFrame(
    {"sets": [[1, 2, 3], [1, 2], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)

shape: (2, 2)
┌───────────┬──────────────────┐
│ sets      ┆ optional_members │
│ ---       ┆ ---              │
│ list[i64] ┆ list[i64]        │
╞═══════════╪══════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3]        │
│ [1, 0]    ┆ [1, 2]           │
└───────────┴──────────────────┘

I would like to build an expression that gets me the elements of the first column that are in the second, keeping the shape of the former, i.e:


shape: (2, 3)
┌───────────┬──────────────────┬─────────────────────┐
│ sets      ┆ optional_members ┆ result              │
│ ---       ┆ ---              ┆ ---                 │
│ list[i64] ┆ list[i64]        ┆ list[bool]          │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3]        ┆ [true, false, true] │
│ [1, 0]    ┆ [1, 2]           ┆ [true, false]       │
└───────────┴──────────────────┴─────────────────────┘

I have tried using eval over the first list, something like:


func = lambda x, y: y.list.contains(x)

df.with_columns(contains=
                pl.col("optional_members")
                .list.
                eval(func(pl.element(), pl.col("optional_members"))))

But the pl.col() expression cannot be in an eval.

How could we aaddress this while keeping the solution in a single expression?

Thanks to @roman comment, a point need to be made: the check should be done regardless of the position.

Share Improve this question edited Nov 20, 2024 at 16:14 yz_jc asked Nov 20, 2024 at 15:09 yz_jcyz_jc 1997 bronze badges 3
  • 1 in your example same elements have the same position, but do you need to check it regardless of position? For example, if first column contains [1,2] and second column contains [2,1] do you want to have [true, true] or [false, false]? – roman Commented Nov 20, 2024 at 15:47
  • This is an amazing point. Indeed, I do need to check regardless of the position. I will add it to the description of the issue. – yz_jc Commented Nov 20, 2024 at 16:13
  • There is .list.set_intersection() to actually get the intersection. stackoverflow/a/79182194 may be relevant depending on what you need to do next with the bools. – jqurious Commented Nov 20, 2024 at 16:51
Add a comment  | 

3 Answers 3

Reset to default 1

If you need to compare elements on the same position:

df.with_columns(
    (pl.col.sets.list.explode() == pl.col.optional_members.list.explode())
    .implode()
    .over(pl.int_range(pl.len()))
    .alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets      ┆ optional_members ┆ result             │
│ ---       ┆ ---              ┆ ---                │
│ list[i64] ┆ list[i64]        ┆ list[bool]         │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3]        ┆ [true, true, true] │
│ [1, 2]    ┆ [1, 2]           ┆ [true, true]       │
│ [9, 10]   ┆ [9, 0]           ┆ [true, false]      │
└───────────┴──────────────────┴────────────────────┘

If you need to compare elements regardless of position then it's a bit more complicated:

df.with_columns(
    pl.col.sets.explode().is_in(pl.col.optional_members.explode())
    .implode()
    .over(pl.int_range(pl.len()))
    .alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets      ┆ optional_members ┆ result             │
│ ---       ┆ ---              ┆ ---                │
│ list[i64] ┆ list[i64]        ┆ list[bool]         │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3]        ┆ [true, true, true] │
│ [1, 2]    ┆ [1, 2]           ┆ [true, true]       │
│ [9, 10]   ┆ [9, 0]           ┆ [true, false]      │
└───────────┴──────────────────┴────────────────────┘

If your lists are not very long or if all the lists are the same length, you can also try to use

  • pl.Expr.list.get().
m = df.select(pl.col.sets.list.len().max()).item()

df.with_columns(
    pl.concat_list(
        pl.col.optional_members.list.contains(pl.col.sets.list.get(i, null_on_oob=True))
        for i in range(m)
    ).list.head(pl.col.sets.list.len())
    .alias("result")
)
shape: (3, 3)
┌───────────┬──────────────────┬────────────────────┐
│ sets      ┆ optional_members ┆ result             │
│ ---       ┆ ---              ┆ ---                │
│ list[i64] ┆ list[i64]        ┆ list[bool]         │
╞═══════════╪══════════════════╪════════════════════╡
│ [1, 2, 3] ┆ [1, 2, 3]        ┆ [true, true, true] │
│ [1, 2]    ┆ [1, 2]           ┆ [true, true]       │
│ [9, 10]   ┆ [9, 0]           ┆ [true, false]      │
└───────────┴──────────────────┴────────────────────┘

One possibility would be to explode, compare, then group_by.agg:

(df.with_row_index() # to be able to group back to original rows
   .explode(['sets', 'optional_members'])
   .with_columns(pl.col('sets').eq(pl.col('optional_members')).alias('result'))
   .group_by('index').agg(pl.col(['sets', 'optional_members', 'result']))
   .drop('index')
)

Another option could be to compute the difference of values and cast to booleans:

df.with_columns(pl.col('sets').sub(pl.col('optional_members'))
                  .cast(pl.List(pl.Boolean)).alias('result')
               )

This would invert the result, but you could tweak it using:

df.with_columns(pl.lit(1).sub(
                   pl.col('sets').sub(pl.col('optional_members'))
                     .cast(pl.List(pl.Boolean))
                   ).cast(pl.List(pl.Boolean)).alias('result')
               )

Output:

┌───────────┬──────────────────┬─────────────────────┐
│ sets      ┆ optional_members ┆ result              │
│ ---       ┆ ---              ┆ ---                 │
│ list[i64] ┆ list[i64]        ┆ list[bool]          │
╞═══════════╪══════════════════╪═════════════════════╡
│ [1, 4, 3] ┆ [1, 2, 3]        ┆ [true, false, true] │
│ [1, 0]    ┆ [1, 2]           ┆ [true, false]       │
│ [9, 10]   ┆ [9, 0]           ┆ [true, false]       │
└───────────┴──────────────────┴─────────────────────┘

Based on some of the feedback, and taking into account the guideline on structs, I think I found an interim solution that works regardless of the dtype (and the order), although it involves map_elements:

df = pl.DataFrame(
    {"sets": [[2, 1, 3], [2, 3], [9, 10]], "optional_members": [[1, 2, 3], [1, 2], [9, 0]]}
)

def test(m, n):
    return [element in n for element in m]


df.with_columns(contains=
                pl.struct(["sets", "optional_members"])
                .map_elements(lambda x: test(x["sets"], x["optional_members"])))

┌───────────┬──────────────────┬────────────────────┐
│ sets      ┆ optional_members ┆ contains           │
│ ---       ┆ ---              ┆ ---                │
│ list[i64] ┆ list[i64]        ┆ list[bool]         │
╞═══════════╪══════════════════╪════════════════════╡
│ [2, 1, 3] ┆ [1, 2, 3]        ┆ [true, true, true] │
│ [2, 3]    ┆ [1, 2]           ┆ [true, false]      │
│ [9, 10]   ┆ [9, 0]           ┆ [true, false]      │
└───────────┴──────────────────┴────────────────────┘
发布评论

评论列表(0)

  1. 暂无评论