I'm trying to use polars
dataframe where I would like to select the first
and last
row per group. Here is a simple example selecting the first row per group:
import polars as pl
df = pl.DataFrame(
{
"a": [1, 2, 2, 3, 4, 5],
"b": [0.5, 0.5, 4, 10, 14, 13],
"c": [True, True, True, False, False, True],
"d": ["Apple", "Apple", "Apple", "Banana", "Banana", "Banana"],
}
)
result = df.group_by("d", maintain_order=True).first()
print(result)
Output:
shape: (2, 4)
┌────────┬─────┬──────┬───────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪═══════╡
│ Apple ┆ 1 ┆ 0.5 ┆ true │
│ Banana ┆ 3 ┆ 10.0 ┆ false │
└────────┴─────┴──────┴───────┘
This works good and we can use .last
to do it for the last row. But how can we combine these in one group_by
?
I'm trying to use polars
dataframe where I would like to select the first
and last
row per group. Here is a simple example selecting the first row per group:
import polars as pl
df = pl.DataFrame(
{
"a": [1, 2, 2, 3, 4, 5],
"b": [0.5, 0.5, 4, 10, 14, 13],
"c": [True, True, True, False, False, True],
"d": ["Apple", "Apple", "Apple", "Banana", "Banana", "Banana"],
}
)
result = df.group_by("d", maintain_order=True).first()
print(result)
Output:
shape: (2, 4)
┌────────┬─────┬──────┬───────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪═══════╡
│ Apple ┆ 1 ┆ 0.5 ┆ true │
│ Banana ┆ 3 ┆ 10.0 ┆ false │
└────────┴─────┴──────┴───────┘
This works good and we can use .last
to do it for the last row. But how can we combine these in one group_by
?
3 Answers
Reset to default 4The solutions by @mozway work well! For completeness, I also wanted to share two solutions relying on pl.Expr.gather
.
In a select Context
df.select(
pl.all().gather([0, -1]).over("d", mapping_strategy="explode")
)
In a group-by Context
(
df
.group_by("d", maintain_order=True)
.agg(
pl.all().gather([0, -1])
)
.explode(pl.exclude("d"))
)
Performance Considerations
I also ran preliminary timings of these methods (on the tiny example dataset).
Method | Timings (mean ± std. dev. of 7 runs, 1,000 loops each) |
---|---|
group_by + concat |
452 μs ± 7.34 μs per loop |
filter |
396 μs ± 10.2 μs per loop |
group_by + gather |
255 μs ± 4.09 μs per loop |
select + gather |
172 μs ± 1.29 μs per loop |
As columns
You could use agg
, you will have to add a suffix
(or prefix
) to differentiate the columns names:
result = (df.group_by('d', maintain_order=True)
.agg(pl.all().first().name.suffix('_first'),
pl.all().last().name.suffix('_last'))
)
Output:
┌────────┬─────────┬─────────┬─────────┬────────┬────────┬────────┐
│ d ┆ a_first ┆ b_first ┆ c_first ┆ a_last ┆ b_last ┆ c_last │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════════╪═════════╪═════════╪════════╪════════╪════════╡
│ Apple ┆ 1 ┆ 0.5 ┆ true ┆ 2 ┆ 4.0 ┆ true │
│ Banana ┆ 3 ┆ 10.0 ┆ false ┆ 5 ┆ 13.0 ┆ true │
└────────┴─────────┴─────────┴─────────┴────────┴────────┴────────┘
As rows
If you want multiple rows, then you would need to concat
:
g = df.group_by('d', maintain_order=True)
result = pl.concat([g.first(), g.last()]).sort(by='d', maintain_order=True)
Output:
┌────────┬─────┬──────┬───────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪═══════╡
│ Apple ┆ 1 ┆ 0.5 ┆ true │
│ Apple ┆ 2 ┆ 4.0 ┆ true │
│ Banana ┆ 3 ┆ 10.0 ┆ false │
│ Banana ┆ 5 ┆ 13.0 ┆ true │
└────────┴─────┴──────┴───────┘
Or using filter
with int_range
+over
:
result = df.filter((pl.int_range(pl.len()).over('d') == 0)
|(pl.int_range(pl.len(), 0, -1).over('d') == 1)
)
Output:
┌─────┬──────┬───────┬────────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str │
╞═════╪══════╪═══════╪════════╡
│ 1 ┆ 0.5 ┆ true ┆ Apple │
│ 2 ┆ 4.0 ┆ true ┆ Apple │
│ 3 ┆ 10.0 ┆ false ┆ Banana │
│ 5 ┆ 13.0 ┆ true ┆ Banana │
└─────┴──────┴───────┴────────┘
There are dedicated first/last methods.
.is_first_distinct()
.is_last_distinct()
df.filter(
pl.any_horizontal(
pl.col("d").is_first_distinct(),
pl.col("d").is_last_distinct()
)
)
shape: (4, 4)
┌─────┬──────┬───────┬────────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str │
╞═════╪══════╪═══════╪════════╡
│ 1 ┆ 0.5 ┆ true ┆ Apple │
│ 2 ┆ 4.0 ┆ true ┆ Apple │
│ 3 ┆ 10.0 ┆ false ┆ Banana │
│ 5 ┆ 13.0 ┆ true ┆ Banana │
└─────┴──────┴───────┴────────┘
You can use a struct if the group identifier is multiple columns.
pl.struct("c", "d").is_first_distinct()
group_by.first
is an aggregation (=1 value per group); do you want first/last as columns (possible withgroup_by
) or rows (not directly possible with a singlegroup_by
)? – mozway Commented Feb 11 at 10:14concat
andint_range
are the desired output. – Quinten Commented Feb 11 at 10:16