最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - normalise a list column in DuckDB SQL - Stack Overflow

programmeradmin2浏览0评论

Say I have:

import polars as pl

df = pl.DataFrame({'a':[1,1,2], 'b': [4,5,6]}).with_columns(c=pl.concat_list('a', 'b'))

print(df)
shape: (3, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ c         │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═══════════╡
│ 1   ┆ 4   ┆ [1, 4]    │
│ 1   ┆ 5   ┆ [1, 5]    │
│ 2   ┆ 6   ┆ [2, 6]    │
└─────┴─────┴───────────┘

I can normalise column 'c' by doing:

In [15]: df.with_columns(c_normalised = pl.col('c') / pl.col('c').list.sum())
Out[15]:
shape: (3, 4)
┌─────┬─────┬───────────┬──────────────────────┐
│ a   ┆ b   ┆ c         ┆ c_normalised         │
│ --- ┆ --- ┆ ---       ┆ ---                  │
│ i64 ┆ i64 ┆ list[i64] ┆ list[f64]            │
╞═════╪═════╪═══════════╪══════════════════════╡
│ 1   ┆ 4   ┆ [1, 4]    ┆ [0.2, 0.8]           │
│ 1   ┆ 5   ┆ [1, 5]    ┆ [0.166667, 0.833333] │
│ 2   ┆ 6   ┆ [2, 6]    ┆ [0.25, 0.75]         │
└─────┴─────┴───────────┴──────────────────────┘

How can I do this in DuckDB? I've tried

In [17]: duckdb.sql("""
    ...: from df
    ...: select c / list_sum(c)
    ...: """)
---------------------------------------------------------------------------
BinderException                           Traceback (most recent call last)
Cell In[17], line 1
----> 1 duckdb.sql("""
      2 from df
      3 select c / list_sum(c)
      4 """)

BinderException: Binder Error: No function matches the given name and argument types '/(BIGINT[], HUGEINT)'. You might need to add explicit type casts.
        Candidate functions:
        /(FLOAT, FLOAT) -> FLOAT
        /(DOUBLE, DOUBLE) -> DOUBLE
        /(INTERVAL, BIGINT) -> INTERVAL

Say I have:

import polars as pl

df = pl.DataFrame({'a':[1,1,2], 'b': [4,5,6]}).with_columns(c=pl.concat_list('a', 'b'))

print(df)
shape: (3, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ c         │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═══════════╡
│ 1   ┆ 4   ┆ [1, 4]    │
│ 1   ┆ 5   ┆ [1, 5]    │
│ 2   ┆ 6   ┆ [2, 6]    │
└─────┴─────┴───────────┘

I can normalise column 'c' by doing:

In [15]: df.with_columns(c_normalised = pl.col('c') / pl.col('c').list.sum())
Out[15]:
shape: (3, 4)
┌─────┬─────┬───────────┬──────────────────────┐
│ a   ┆ b   ┆ c         ┆ c_normalised         │
│ --- ┆ --- ┆ ---       ┆ ---                  │
│ i64 ┆ i64 ┆ list[i64] ┆ list[f64]            │
╞═════╪═════╪═══════════╪══════════════════════╡
│ 1   ┆ 4   ┆ [1, 4]    ┆ [0.2, 0.8]           │
│ 1   ┆ 5   ┆ [1, 5]    ┆ [0.166667, 0.833333] │
│ 2   ┆ 6   ┆ [2, 6]    ┆ [0.25, 0.75]         │
└─────┴─────┴───────────┴──────────────────────┘

How can I do this in DuckDB? I've tried

In [17]: duckdb.sql("""
    ...: from df
    ...: select c / list_sum(c)
    ...: """)
---------------------------------------------------------------------------
BinderException                           Traceback (most recent call last)
Cell In[17], line 1
----> 1 duckdb.sql("""
      2 from df
      3 select c / list_sum(c)
      4 """)

BinderException: Binder Error: No function matches the given name and argument types '/(BIGINT[], HUGEINT)'. You might need to add explicit type casts.
        Candidate functions:
        /(FLOAT, FLOAT) -> FLOAT
        /(DOUBLE, DOUBLE) -> DOUBLE
        /(INTERVAL, BIGINT) -> INTERVAL
Share Improve this question edited Mar 17 at 14:03 ignoring_gravity asked Mar 17 at 11:50 ignoring_gravityignoring_gravity 10.6k7 gold badges44 silver badges88 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 2

Found it:

In [26]: duckdb.sql("""
    ...: from df
    ...: select *, list_transform(c, x -> x / list_sum(c)) as c_normalised
    ...: """)
Out[26]:
┌───────┬───────┬─────────┬───────────────────────────────────────────┐
│   a   │   b   │    c    │               c_normalised                │
│ int64 │ int64 │ int64[] │                 double[]                  │
├───────┼───────┼─────────┼───────────────────────────────────────────┤
│     1 │     4 │ [1, 4]  │ [0.2, 0.8]                                │
│     1 │     5 │ [1, 5]  │ [0.16666666666666666, 0.8333333333333334] │
│     2 │     6 │ [2, 6]  │ [0.25, 0.75]                              │
└───────┴───────┴─────────┴───────────────────────────────────────────┘


Or, even nicer:


In [39]: duckdb.sql("""
    ...: from df
    ...: select *, [x / list_sum(c) for x in c] as c_normalised
    ...: """)
Out[39]:
┌───────┬───────┬─────────┬───────────────────────────────────────────┐
│   a   │   b   │    c    │               c_normalised                │
│ int64 │ int64 │ int64[] │                 double[]                  │
├───────┼───────┼─────────┼───────────────────────────────────────────┤
│     1 │     4 │ [1, 4]  │ [0.2, 0.8]                                │
│     1 │     5 │ [1, 5]  │ [0.16666666666666666, 0.8333333333333334] │
│     2 │     6 │ [2, 6]  │ [0.25, 0.75]                              │
└───────┴───────┴─────────┴───────────────────────────────────────────┘
发布评论

评论列表(0)

  1. 暂无评论