Say we have this dataframe:
import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
'US': {'size': 100, 'GDP': 800},
'AS': {'size': 80, 'GDP': 500}})
shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80} ┆ {100,800} ┆ {80,500} │
└───────────┴───────────┴───────────┘
I am looking for a function like df.expand_structs(column_name='metric')
that gives
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GBP ┆ 80 ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘
I've tried other functions like unnest
, explode
but no luck. Any help appreciated!
Say we have this dataframe:
import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
'US': {'size': 100, 'GDP': 800},
'AS': {'size': 80, 'GDP': 500}})
shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80} ┆ {100,800} ┆ {80,500} │
└───────────┴───────────┴───────────┘
I am looking for a function like df.expand_structs(column_name='metric')
that gives
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GBP ┆ 80 ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘
I've tried other functions like unnest
, explode
but no luck. Any help appreciated!
5 Answers
Reset to default 4TL;DR
Performance comparison at the end.
Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]
) and columns N <= 15
(not collected). Other methods fail lazily.
With bigger Structs and/or columns N > 15
, both unpivot
options below start to outperform. Other suggested methods thus far slower in general.
Option 1
out = (df.unpivot()
.unnest('value')
.select(pl.exclude('variable'))
.transpose(include_header=True)
.pipe(
lambda x: x.rename(
dict(zip(x.columns, ['metric'] + df.columns))
)
)
)
Output:
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘
Explanation / Intermediates
- Use
df.unpivot
:
shape: (3, 2)
┌──────────┬───────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ struct[2] │
╞══════════╪═══════════╡
│ EU ┆ {10,80} │
│ US ┆ {100,800} │
│ AS ┆ {80,500} │
└──────────┴───────────┘
- So that we can apply
df.unnest
on new 'value' column:
shape: (3, 3)
┌──────────┬──────┬─────┐
│ variable ┆ size ┆ GDP │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪══════╪═════╡
│ EU ┆ 10 ┆ 80 │
│ US ┆ 100 ┆ 800 │
│ AS ┆ 80 ┆ 500 │
└──────────┴──────┴─────┘
- Use
df.select
to exclude 'variable' column (pl.exclude
) anddf.transpose
withinclude_header=True
:
shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╪══════════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└────────┴──────────┴──────────┴──────────┘
- Now, we just need to rename the columns. Here done via
df.pipe
+df.rename
. Without the chained operation, that can also be:
out.columns = ['metric'] + df.columns
Option 2
out2 = (df.unpivot()
.unnest('value')
.unpivot(index='variable', variable_name='metric')
.pivot(on='variable', index='metric')
)
Equality check:
out.equals(out2)
# True
Explanation / Intermediates
- Same start as option 1, but followed by a second
df.unpivot
to get:
shape: (6, 3)
┌────────┬────────┬───────┐
│ column ┆ metric ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪════════╪═══════╡
│ EU ┆ size ┆ 10 │
│ US ┆ size ┆ 100 │
│ AS ┆ size ┆ 80 │
│ EU ┆ GDP ┆ 80 │
│ US ┆ GDP ┆ 800 │
│ AS ┆ GDP ┆ 500 │
└────────┴────────┴───────┘
- Followed by
df.pivot
on 'column' with 'metric' as the index to get desired shape.
Performance comparison (gist)
Columns: n_range=[2**k for k in range(12)]
Struct: 2, 20, 100
Methods compared:
- unpivot_unnest_t (option 1), #@ouroboros1
- unpivot_unnest_t2 (option 1, adj)
- unpivot_pivot (option 2)
- concat_list_expl, #@etrotta
- concat_list_expl_lazy, #lazy
- concat_list_expl2, #@etrotta, #@DeanMacGregor
- concat_list_expl2_lazy, #lazy
- map_batches, #@DeanMacGregor
- loop, #@sammywemmy
Results:
Working with Structs typically gets a bit awkward when you have multiple columns with the same fields, I would first turn into lists then explode
schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']
df.select(
pl.lit(metrics).alias("metrics"),
*(pl.concat_list(
pl.col(country).struct.field(metric)
for metric in metrics
).alias(country) for country in countries),
).explode(pl.all())
A variation of the answer from @erotta without exploding.
schema = df.collect_schema()
countries = schema.names()
metrics = list(schema[countries[0]].to_schema())
metric = pl.concat(
pl.repeat(metric, pl.len()).alias("metric")
for metric in metrics
)
values = [
pl.concat([pl.col(country).struct.field(metrics)]).alias(country)
for country in countries
]
df.select(metric, *values)
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘
I think etrotta's method will be more efficient but here's a way that is syntactically shorter
df.select(
pl.Series('metric', (metrics:=[x.name for x in df.dtypes[0].fields])),
pl.all().map_batches(lambda s: (
s.to_frame().unnest(s.name)
.select(pl.concat_list(metrics).explode())
.to_series().alias(s.name)
))
)
Note the walrus operator in in the Series
and then the reuse of metrics
in concat_list
. If you're confident that the fields will be in the same order in each of your structs then you could forego the walrus and just use pl.all()
inside the concat_list
.
Alternatively, if you don't like referring to the df inside of its own context then you could create the metrics column this way which assumes all the structs' fields will be in the same order.
df.select(
pl.first().map_batches(lambda s: pl.Series(s.struct.fields)).alias('metrics'),
pl.all().map_batches(lambda s: (
s.to_frame().unnest(s.name)
.select(pl.concat_list(pl.all()).explode())
.to_series().alias(s.name)
))
)
shape: (2, 4)
┌─────────┬─────┬─────┬─────┐
│ metrics ┆ EU ┆ US ┆ AS │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╪═════╡
│ size ┆ 10 ┆ 100 ┆ 80 │
│ GDP ┆ 80 ┆ 800 ┆ 500 │
└─────────┴─────┴─────┴─────┘
speedwise (and simplicity maybe), I would suggest using a for loop to create the individual Series, and then create a new DataFrame. This approach is faster than @etrotta's excellent work:
import polars as pl
# reusing @etrotta's work:
schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']
# build a dictionary of Series
# and subsequently create a new DataFrame
mapping = {}
for country in countries:
array = []
for metric in metrics:
series = df.get_column(country).struct.field(metric)
array.append(series)
mapping[country] = pl.concat(array)
# if you are not opposed to using another library
# numpy.repeat fits in nicely here
# and should offer good perf as well
array = []
for metric in metrics:
array.append(pl.repeat(metric,n=len(df),eager=True))
mapping['metrics']=pl.concat(array)
pl.DataFrame(mapping)
shape: (2, 4)
┌─────┬─────┬─────┬─────────┐
│ EU ┆ US ┆ AS ┆ metrics │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════════╡
│ 10 ┆ 100 ┆ 80 ┆ size │
│ 80 ┆ 800 ┆ 500 ┆ GDP │
└─────┴─────┴─────┴─────────┘
Of course the speed tests are based on your shared data; would it still be performant for a large number of columns (width, not length now the controlling factor)?
NB: If the field names could be accessed directly within a context, then that would probably offer even more performance, as everything would occur within the polars framework