最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Select the first and last row per group in Polars dataframe - Stack Overflow

programmeradmin2浏览0评论

I'm trying to use polars dataframe where I would like to select the first and last row per group. Here is a simple example selecting the first row per group:

import polars as pl

df = pl.DataFrame(
    {
        "a": [1, 2, 2, 3, 4, 5],
        "b": [0.5, 0.5, 4, 10, 14, 13],
        "c": [True, True, True, False, False, True],
        "d": ["Apple", "Apple", "Apple", "Banana", "Banana", "Banana"],
    }
)
result = df.group_by("d", maintain_order=True).first()
print(result)

Output:

shape: (2, 4)
┌────────┬─────┬──────┬───────┐
│ d      ┆ a   ┆ b    ┆ c     │
│ ---    ┆ --- ┆ ---  ┆ ---   │
│ str    ┆ i64 ┆ f64  ┆ bool  │
╞════════╪═════╪══════╪═══════╡
│ Apple  ┆ 1   ┆ 0.5  ┆ true  │
│ Banana ┆ 3   ┆ 10.0 ┆ false │
└────────┴─────┴──────┴───────┘

This works good and we can use .last to do it for the last row. But how can we combine these in one group_by?

I'm trying to use polars dataframe where I would like to select the first and last row per group. Here is a simple example selecting the first row per group:

import polars as pl

df = pl.DataFrame(
    {
        "a": [1, 2, 2, 3, 4, 5],
        "b": [0.5, 0.5, 4, 10, 14, 13],
        "c": [True, True, True, False, False, True],
        "d": ["Apple", "Apple", "Apple", "Banana", "Banana", "Banana"],
    }
)
result = df.group_by("d", maintain_order=True).first()
print(result)

Output:

shape: (2, 4)
┌────────┬─────┬──────┬───────┐
│ d      ┆ a   ┆ b    ┆ c     │
│ ---    ┆ --- ┆ ---  ┆ ---   │
│ str    ┆ i64 ┆ f64  ┆ bool  │
╞════════╪═════╪══════╪═══════╡
│ Apple  ┆ 1   ┆ 0.5  ┆ true  │
│ Banana ┆ 3   ┆ 10.0 ┆ false │
└────────┴─────┴──────┴───────┘

This works good and we can use .last to do it for the last row. But how can we combine these in one group_by?

Share asked Feb 11 at 9:57 QuintenQuinten 41.6k11 gold badges48 silver badges105 bronze badges 2
  • I realize the question is ambiguous, since group_by.first is an aggregation (=1 value per group); do you want first/last as columns (possible with group_by) or rows (not directly possible with a single group_by)? – mozway Commented Feb 11 at 10:14
  • 1 Hi @mozway, Thank you for your answer. Your concat and int_range are the desired output. – Quinten Commented Feb 11 at 10:16
Add a comment  | 

3 Answers 3

Reset to default 4

The solutions by @mozway work well! For completeness, I also wanted to share two solutions relying on pl.Expr.gather.

In a select Context

df.select(
    pl.all().gather([0, -1]).over("d", mapping_strategy="explode")
)

In a group-by Context

(
    df
    .group_by("d", maintain_order=True)
    .agg(
        pl.all().gather([0, -1])
    )
    .explode(pl.exclude("d"))
)

Performance Considerations

I also ran preliminary timings of these methods (on the tiny example dataset).

Method Timings (mean ± std. dev. of 7 runs, 1,000 loops each)
group_by + concat 452 μs ± 7.34 μs per loop
filter 396 μs ± 10.2 μs per loop
group_by + gather 255 μs ± 4.09 μs per loop
select + gather 172 μs ± 1.29 μs per loop

As columns

You could use agg, you will have to add a suffix (or prefix) to differentiate the columns names:

result = (df.group_by('d', maintain_order=True)
            .agg(pl.all().first().name.suffix('_first'),
                 pl.all().last().name.suffix('_last'))
         )

Output:

┌────────┬─────────┬─────────┬─────────┬────────┬────────┬────────┐
│ d      ┆ a_first ┆ b_first ┆ c_first ┆ a_last ┆ b_last ┆ c_last │
│ ---    ┆ ---     ┆ ---     ┆ ---     ┆ ---    ┆ ---    ┆ ---    │
│ str    ┆ i64     ┆ f64     ┆ bool    ┆ i64    ┆ f64    ┆ bool   │
╞════════╪═════════╪═════════╪═════════╪════════╪════════╪════════╡
│ Apple  ┆ 1       ┆ 0.5     ┆ true    ┆ 2      ┆ 4.0    ┆ true   │
│ Banana ┆ 3       ┆ 10.0    ┆ false   ┆ 5      ┆ 13.0   ┆ true   │
└────────┴─────────┴─────────┴─────────┴────────┴────────┴────────┘

As rows

If you want multiple rows, then you would need to concat:

g = df.group_by('d', maintain_order=True)

result = pl.concat([g.first(), g.last()]).sort(by='d', maintain_order=True)

Output:

┌────────┬─────┬──────┬───────┐
│ d      ┆ a   ┆ b    ┆ c     │
│ ---    ┆ --- ┆ ---  ┆ ---   │
│ str    ┆ i64 ┆ f64  ┆ bool  │
╞════════╪═════╪══════╪═══════╡
│ Apple  ┆ 1   ┆ 0.5  ┆ true  │
│ Apple  ┆ 2   ┆ 4.0  ┆ true  │
│ Banana ┆ 3   ┆ 10.0 ┆ false │
│ Banana ┆ 5   ┆ 13.0 ┆ true  │
└────────┴─────┴──────┴───────┘

Or using filter with int_range+over:

result = df.filter((pl.int_range(pl.len()).over('d') == 0)
                  |(pl.int_range(pl.len(), 0, -1).over('d') == 1)
                  )

Output:

┌─────┬──────┬───────┬────────┐
│ a   ┆ b    ┆ c     ┆ d      │
│ --- ┆ ---  ┆ ---   ┆ ---    │
│ i64 ┆ f64  ┆ bool  ┆ str    │
╞═════╪══════╪═══════╪════════╡
│ 1   ┆ 0.5  ┆ true  ┆ Apple  │
│ 2   ┆ 4.0  ┆ true  ┆ Apple  │
│ 3   ┆ 10.0 ┆ false ┆ Banana │
│ 5   ┆ 13.0 ┆ true  ┆ Banana │
└─────┴──────┴───────┴────────┘

There are dedicated first/last methods.

  • .is_first_distinct()
  • .is_last_distinct()
df.filter(
    pl.any_horizontal(
        pl.col("d").is_first_distinct(),
        pl.col("d").is_last_distinct()
    )
)
shape: (4, 4)
┌─────┬──────┬───────┬────────┐
│ a   ┆ b    ┆ c     ┆ d      │
│ --- ┆ ---  ┆ ---   ┆ ---    │
│ i64 ┆ f64  ┆ bool  ┆ str    │
╞═════╪══════╪═══════╪════════╡
│ 1   ┆ 0.5  ┆ true  ┆ Apple  │
│ 2   ┆ 4.0  ┆ true  ┆ Apple  │
│ 3   ┆ 10.0 ┆ false ┆ Banana │
│ 5   ┆ 13.0 ┆ true  ┆ Banana │
└─────┴──────┴───────┴────────┘

You can use a struct if the group identifier is multiple columns.

pl.struct("c", "d").is_first_distinct()
发布评论

评论列表(0)

  1. 暂无评论