python - Expand Columns of Structs to Rows in polars

Say we have this dataframe:

import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
                   'US': {'size': 100, 'GDP': 800},
                   'AS': {'size': 80, 'GDP': 500}})

shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU        ┆ US        ┆ AS        │
│ ---       ┆ ---       ┆ ---       │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80}   ┆ {100,800} ┆ {80,500}  │
└───────────┴───────────┴───────────┘

I am looking for a function like df.expand_structs(column_name='metric') that gives

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GBP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I've tried other functions like unnest, explode but no luck. Any help appreciated!

Say we have this dataframe:

import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
                   'US': {'size': 100, 'GDP': 800},
                   'AS': {'size': 80, 'GDP': 500}})

shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU        ┆ US        ┆ AS        │
│ ---       ┆ ---       ┆ ---       │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80}   ┆ {100,800} ┆ {80,500}  │
└───────────┴───────────┴───────────┘

I am looking for a function like df.expand_structs(column_name='metric') that gives

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GBP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I've tried other functions like unnest, explode but no luck. Any help appreciated!

Share Improve this question edited Mar 26 at 18:15 jqurious 22k5 gold badges20 silver badges39 bronze badges asked Mar 26 at 16:53 Phil-ZXX 3,3653 gold badges31 silver badges52 bronze badges

Add a comment |

5 Answers 5

Sorted by: Reset to default 4

TL;DR

Performance comparison at the end.

Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]) and columns N <= 15 (not collected). Other methods fail lazily.

With bigger Structs and/or columns N > 15, both unpivot options below start to outperform. Other suggested methods thus far slower in general.

Option 1

out = (df.unpivot()
       .unnest('value')
       .select(pl.exclude('variable'))
       .transpose(include_header=True)
       .pipe(
           lambda x: x.rename(
               dict(zip(x.columns, ['metric'] + df.columns))
               )
           )
       )

Output:

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

Explanation / Intermediates

Use df.unpivot:

shape: (3, 2)
┌──────────┬───────────┐
│ variable ┆ value     │
│ ---      ┆ ---       │
│ str      ┆ struct[2] │
╞══════════╪═══════════╡
│ EU       ┆ {10,80}   │
│ US       ┆ {100,800} │
│ AS       ┆ {80,500}  │
└──────────┴───────────┘

So that we can apply df.unnest on new 'value' column:

shape: (3, 3)
┌──────────┬──────┬─────┐
│ variable ┆ size ┆ GDP │
│ ---      ┆ ---  ┆ --- │
│ str      ┆ i64  ┆ i64 │
╞══════════╪══════╪═════╡
│ EU       ┆ 10   ┆ 80  │
│ US       ┆ 100  ┆ 800 │
│ AS       ┆ 80   ┆ 500 │
└──────────┴──────┴─────┘

Use df.select to exclude 'variable' column (pl.exclude) and df.transpose with include_header=True:

shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ ---    ┆ ---      ┆ ---      ┆ ---      │
│ str    ┆ i64      ┆ i64      ┆ i64      │
╞════════╪══════════╪══════════╪══════════╡
│ size   ┆ 10       ┆ 100      ┆ 80       │
│ GDP    ┆ 80       ┆ 800      ┆ 500      │
└────────┴──────────┴──────────┴──────────┘

Now, we just need to rename the columns. Here done via df.pipe + df.rename. Without the chained operation, that can also be:

out.columns = ['metric'] + df.columns

Option 2

out2 = (df.unpivot()
        .unnest('value')
        .unpivot(index='variable', variable_name='metric')
        .pivot(on='variable', index='metric')
        )

Equality check:

out.equals(out2)
# True

Explanation / Intermediates

Same start as option 1, but followed by a second df.unpivot to get:

shape: (6, 3)
┌────────┬────────┬───────┐
│ column ┆ metric ┆ value │
│ ---    ┆ ---    ┆ ---   │
│ str    ┆ str    ┆ i64   │
╞════════╪════════╪═══════╡
│ EU     ┆ size   ┆ 10    │
│ US     ┆ size   ┆ 100   │
│ AS     ┆ size   ┆ 80    │
│ EU     ┆ GDP    ┆ 80    │
│ US     ┆ GDP    ┆ 800   │
│ AS     ┆ GDP    ┆ 500   │
└────────┴────────┴───────┘

Followed by df.pivot on 'column' with 'metric' as the index to get desired shape.

Performance comparison (gist)

Columns: n_range=[2**k for k in range(12)]

Struct: 2, 20, 100

Methods compared:

unpivot_unnest_t (option 1), #@ouroboros1
unpivot_unnest_t2 (option 1, adj)
unpivot_pivot (option 2)
concat_list_expl, #@etrotta
concat_list_expl_lazy, #lazy
concat_list_expl2, #@etrotta, #@DeanMacGregor
concat_list_expl2_lazy, #lazy
map_batches, #@DeanMacGregor
loop, #@sammywemmy

Results:

Working with Structs typically gets a bit awkward when you have multiple columns with the same fields, I would first turn into lists then explode

schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']

df.select(
    pl.lit(metrics).alias("metrics"),
    *(pl.concat_list(
        pl.col(country).struct.field(metric)
        for metric in metrics
    ).alias(country) for country in countries),
).explode(pl.all())

A variation of the answer from @erotta without exploding.

schema = df.collect_schema()
countries = schema.names()
metrics = list(schema[countries[0]].to_schema())

metric = pl.concat(
    pl.repeat(metric, pl.len()).alias("metric")
    for metric in metrics
)

values = [
    pl.concat([pl.col(country).struct.field(metrics)]).alias(country) 
    for country in countries
] 

df.select(metric, *values)

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I think etrotta's method will be more efficient but here's a way that is syntactically shorter

df.select(
    pl.Series('metric', (metrics:=[x.name for x in df.dtypes[0].fields])),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(metrics).explode())
        .to_series().alias(s.name)
    ))
    )

Note the walrus operator in in the Series and then the reuse of metrics in concat_list. If you're confident that the fields will be in the same order in each of your structs then you could forego the walrus and just use pl.all() inside the concat_list.

Alternatively, if you don't like referring to the df inside of its own context then you could create the metrics column this way which assumes all the structs' fields will be in the same order.

df.select(
    pl.first().map_batches(lambda s: pl.Series(s.struct.fields)).alias('metrics'),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(pl.all()).explode())
        .to_series().alias(s.name)
    ))
    )
shape: (2, 4)
┌─────────┬─────┬─────┬─────┐
│ metrics ┆ EU  ┆ US  ┆ AS  │
│ ---     ┆ --- ┆ --- ┆ --- │
│ str     ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╪═════╡
│ size    ┆ 10  ┆ 100 ┆ 80  │
│ GDP     ┆ 80  ┆ 800 ┆ 500 │
└─────────┴─────┴─────┴─────┘

speedwise (and simplicity maybe), I would suggest using a for loop to create the individual Series, and then create a new DataFrame. This approach is faster than @etrotta's excellent work:

import polars as pl

# reusing @etrotta's work:
schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']


# build a dictionary of Series
# and subsequently create a new DataFrame
mapping = {}
for country in countries:
    array = []
    for metric in metrics:
        series = df.get_column(country).struct.field(metric)
        array.append(series)
    mapping[country] = pl.concat(array)

# if you are not opposed to using another library
# numpy.repeat fits in nicely here
# and should offer good perf as well
array = []
for metric in metrics:
    array.append(pl.repeat(metric,n=len(df),eager=True))
mapping['metrics']=pl.concat(array)
pl.DataFrame(mapping)

shape: (2, 4)
┌─────┬─────┬─────┬─────────┐
│ EU  ┆ US  ┆ AS  ┆ metrics │
│ --- ┆ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ i64 ┆ str     │
╞═════╪═════╪═════╪═════════╡
│ 10  ┆ 100 ┆ 80  ┆ size    │
│ 80  ┆ 800 ┆ 500 ┆ GDP     │
└─────┴─────┴─────┴─────────┘

Of course the speed tests are based on your shared data; would it still be performant for a large number of columns (width, not length now the controlling factor)?

NB: If the field names could be accessed directly within a context, then that would probably offer even more performance, as everything would occur within the polars framework

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Expand Columns of Structs to Rows in polars - Stack Overflow

5 Answers 5

与本文相关的文章

评论列表(0)