最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Expand Columns of Structs to Rows in polars - Stack Overflow

programmeradmin1浏览0评论

Say we have this dataframe:

import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
                   'US': {'size': 100, 'GDP': 800},
                   'AS': {'size': 80, 'GDP': 500}})

shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU        ┆ US        ┆ AS        │
│ ---       ┆ ---       ┆ ---       │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80}   ┆ {100,800} ┆ {80,500}  │
└───────────┴───────────┴───────────┘

I am looking for a function like df.expand_structs(column_name='metric') that gives

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GBP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I've tried other functions like unnest, explode but no luck. Any help appreciated!

Say we have this dataframe:

import polars as pl
df = pl.DataFrame({'EU': {'size': 10, 'GDP': 80},
                   'US': {'size': 100, 'GDP': 800},
                   'AS': {'size': 80, 'GDP': 500}})

shape: (1, 3)
┌───────────┬───────────┬───────────┐
│ EU        ┆ US        ┆ AS        │
│ ---       ┆ ---       ┆ ---       │
│ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╪═══════════╡
│ {10,80}   ┆ {100,800} ┆ {80,500}  │
└───────────┴───────────┴───────────┘

I am looking for a function like df.expand_structs(column_name='metric') that gives

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GBP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I've tried other functions like unnest, explode but no luck. Any help appreciated!

Share Improve this question edited Mar 26 at 18:15 jqurious 22k5 gold badges20 silver badges39 bronze badges asked Mar 26 at 16:53 Phil-ZXXPhil-ZXX 3,3653 gold badges31 silver badges52 bronze badges
Add a comment  | 

5 Answers 5

Reset to default 4

TL;DR

Performance comparison at the end.

Both @etrotta's method and @DeanMacGregor's adjustment perform well on a pl.lazyframe with small Structs (e.g., struct[2]) and columns N <= 15 (not collected). Other methods fail lazily.

With bigger Structs and/or columns N > 15, both unpivot options below start to outperform. Other suggested methods thus far slower in general.


Option 1

out = (df.unpivot()
       .unnest('value')
       .select(pl.exclude('variable'))
       .transpose(include_header=True)
       .pipe(
           lambda x: x.rename(
               dict(zip(x.columns, ['metric'] + df.columns))
               )
           )
       )

Output:

shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

Explanation / Intermediates

  • Use df.unpivot:
shape: (3, 2)
┌──────────┬───────────┐
│ variable ┆ value     │
│ ---      ┆ ---       │
│ str      ┆ struct[2] │
╞══════════╪═══════════╡
│ EU       ┆ {10,80}   │
│ US       ┆ {100,800} │
│ AS       ┆ {80,500}  │
└──────────┴───────────┘
  • So that we can apply df.unnest on new 'value' column:
shape: (3, 3)
┌──────────┬──────┬─────┐
│ variable ┆ size ┆ GDP │
│ ---      ┆ ---  ┆ --- │
│ str      ┆ i64  ┆ i64 │
╞══════════╪══════╪═════╡
│ EU       ┆ 10   ┆ 80  │
│ US       ┆ 100  ┆ 800 │
│ AS       ┆ 80   ┆ 500 │
└──────────┴──────┴─────┘
  • Use df.select to exclude 'variable' column (pl.exclude) and df.transpose with include_header=True:
shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ ---    ┆ ---      ┆ ---      ┆ ---      │
│ str    ┆ i64      ┆ i64      ┆ i64      │
╞════════╪══════════╪══════════╪══════════╡
│ size   ┆ 10       ┆ 100      ┆ 80       │
│ GDP    ┆ 80       ┆ 800      ┆ 500      │
└────────┴──────────┴──────────┴──────────┘
  • Now, we just need to rename the columns. Here done via df.pipe + df.rename. Without the chained operation, that can also be:
out.columns = ['metric'] + df.columns

Option 2

out2 = (df.unpivot()
        .unnest('value')
        .unpivot(index='variable', variable_name='metric')
        .pivot(on='variable', index='metric')
        )

Equality check:

out.equals(out2)
# True

Explanation / Intermediates

  • Same start as option 1, but followed by a second df.unpivot to get:
shape: (6, 3)
┌────────┬────────┬───────┐
│ column ┆ metric ┆ value │
│ ---    ┆ ---    ┆ ---   │
│ str    ┆ str    ┆ i64   │
╞════════╪════════╪═══════╡
│ EU     ┆ size   ┆ 10    │
│ US     ┆ size   ┆ 100   │
│ AS     ┆ size   ┆ 80    │
│ EU     ┆ GDP    ┆ 80    │
│ US     ┆ GDP    ┆ 800   │
│ AS     ┆ GDP    ┆ 500   │
└────────┴────────┴───────┘
  • Followed by df.pivot on 'column' with 'metric' as the index to get desired shape.

Performance comparison (gist)

Columns: n_range=[2**k for k in range(12)]

Struct: 2, 20, 100

Methods compared:

  • unpivot_unnest_t (option 1), #@ouroboros1
  • unpivot_unnest_t2 (option 1, adj)
  • unpivot_pivot (option 2)
  • concat_list_expl, #@etrotta
  • concat_list_expl_lazy, #lazy
  • concat_list_expl2, #@etrotta, #@DeanMacGregor
  • concat_list_expl2_lazy, #lazy
  • map_batches, #@DeanMacGregor
  • loop, #@sammywemmy

Results:

Working with Structs typically gets a bit awkward when you have multiple columns with the same fields, I would first turn into lists then explode

schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']

df.select(
    pl.lit(metrics).alias("metrics"),
    *(pl.concat_list(
        pl.col(country).struct.field(metric)
        for metric in metrics
    ).alias(country) for country in countries),
).explode(pl.all())

A variation of the answer from @erotta without exploding.

schema = df.collect_schema()
countries = schema.names()
metrics = list(schema[countries[0]].to_schema())

metric = pl.concat(
    pl.repeat(metric, pl.len()).alias("metric")
    for metric in metrics
)

values = [
    pl.concat([pl.col(country).struct.field(metrics)]).alias(country) 
    for country in countries
] 

df.select(metric, *values)
shape: (2, 4)
┌────────┬─────┬─────┬─────┐
│ metric ┆ EU  ┆ US  ┆ AS  │
│ ---    ┆ --- ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════╪═════╪═════╡
│ size   ┆ 10  ┆ 100 ┆ 80  │
│ GDP    ┆ 80  ┆ 800 ┆ 500 │
└────────┴─────┴─────┴─────┘

I think etrotta's method will be more efficient but here's a way that is syntactically shorter

df.select(
    pl.Series('metric', (metrics:=[x.name for x in df.dtypes[0].fields])),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(metrics).explode())
        .to_series().alias(s.name)
    ))
    )

Note the walrus operator in in the Series and then the reuse of metrics in concat_list. If you're confident that the fields will be in the same order in each of your structs then you could forego the walrus and just use pl.all() inside the concat_list.

Alternatively, if you don't like referring to the df inside of its own context then you could create the metrics column this way which assumes all the structs' fields will be in the same order.

df.select(
    pl.first().map_batches(lambda s: pl.Series(s.struct.fields)).alias('metrics'),
    pl.all().map_batches(lambda s: (
        s.to_frame().unnest(s.name)
        .select(pl.concat_list(pl.all()).explode())
        .to_series().alias(s.name)
    ))
    )
shape: (2, 4)
┌─────────┬─────┬─────┬─────┐
│ metrics ┆ EU  ┆ US  ┆ AS  │
│ ---     ┆ --- ┆ --- ┆ --- │
│ str     ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╪═════╡
│ size    ┆ 10  ┆ 100 ┆ 80  │
│ GDP     ┆ 80  ┆ 800 ┆ 500 │
└─────────┴─────┴─────┴─────┘

speedwise (and simplicity maybe), I would suggest using a for loop to create the individual Series, and then create a new DataFrame. This approach is faster than @etrotta's excellent work:

import polars as pl

# reusing @etrotta's work:
schema = df.collect_schema()
countries = schema.names()
# countries = ['EU', 'US', 'AS']
metrics = [field.name for field in schema[countries[0]].fields]
# metrics = ['size', 'GDP']


# build a dictionary of Series
# and subsequently create a new DataFrame
mapping = {}
for country in countries:
    array = []
    for metric in metrics:
        series = df.get_column(country).struct.field(metric)
        array.append(series)
    mapping[country] = pl.concat(array)

# if you are not opposed to using another library
# numpy.repeat fits in nicely here
# and should offer good perf as well
array = []
for metric in metrics:
    array.append(pl.repeat(metric,n=len(df),eager=True))
mapping['metrics']=pl.concat(array)
pl.DataFrame(mapping)

shape: (2, 4)
┌─────┬─────┬─────┬─────────┐
│ EU  ┆ US  ┆ AS  ┆ metrics │
│ --- ┆ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ i64 ┆ str     │
╞═════╪═════╪═════╪═════════╡
│ 10  ┆ 100 ┆ 80  ┆ size    │
│ 80  ┆ 800 ┆ 500 ┆ GDP     │
└─────┴─────┴─────┴─────────┘

Of course the speed tests are based on your shared data; would it still be performant for a large number of columns (width, not length now the controlling factor)?

NB: If the field names could be accessed directly within a context, then that would probably offer even more performance, as everything would occur within the polars framework

发布评论

评论列表(0)

  1. 暂无评论