最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Python Pandas: Groupby multiple columns and linearly interpolate values of column Y based on another X column - Stack Overflow

programmeradmin2浏览0评论

Consider the following pandas dataframe

    reference   sicovam     label       id      date        TTM price
0   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  18  52.69
1   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  30  NaN
2   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  49  53.11
3   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  60  NaN
4   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  77  53.69
5   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  90  NaN
6   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  109 54.42
7   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  137 55.15
8   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  171 55.80
9   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  180 NaN
10  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  15  50.04
11  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  30  NaN
12  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  46  50.52
13  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  60  NaN
14  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  74  51.17
15  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  90  NaN
16  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  106 51.95
17  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  134 52.73
18  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  168 53.46
19  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  180 NaN

After grouping by the reference, sicovam, label, id and date columns, I would like to fill the NaN values of the price column via linear interpolation over the TTM value i.e., in the context of the linear interpolation formula, price is the y and TTM is the x variable.

So far, I built the following lines.

def intepolate_group(group):
    group["price"] = group["price"].interpolate(method='linear', limit_direction='both', axis=0)
    return group

new_df = df.groupby(["reference","sicovam","label","id","date"])[["TTM","price"]].apply(intepolate_group)

Nevertheless, the result that I get is the linear interpolation over the index numbers per group. For example for the following part of the dataset, I get 54.06 instead of 53.99. What do I still need in order to interpolate over the TTM variable?

PS: I want to avoid masking via loop (instead of grouping) and setting the TTM as the index, because the dataframe is quite big and such a scenario takes considerable amount of time.

4   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  77  53.69
5   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  90  NaN
6   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  109 54.42

Consider the following pandas dataframe

    reference   sicovam     label       id      date        TTM price
0   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  18  52.69
1   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  30  NaN
2   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  49  53.11
3   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  60  NaN
4   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  77  53.69
5   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  90  NaN
6   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  109 54.42
7   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  137 55.15
8   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  171 55.80
9   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  180 NaN
10  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  15  50.04
11  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  30  NaN
12  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  46  50.52
13  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  60  NaN
14  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  74  51.17
15  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  90  NaN
16  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  106 51.95
17  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  134 52.73
18  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  168 53.46
19  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  180 NaN

After grouping by the reference, sicovam, label, id and date columns, I would like to fill the NaN values of the price column via linear interpolation over the TTM value i.e., in the context of the linear interpolation formula, price is the y and TTM is the x variable.

So far, I built the following lines.

def intepolate_group(group):
    group["price"] = group["price"].interpolate(method='linear', limit_direction='both', axis=0)
    return group

new_df = df.groupby(["reference","sicovam","label","id","date"])[["TTM","price"]].apply(intepolate_group)

Nevertheless, the result that I get is the linear interpolation over the index numbers per group. For example for the following part of the dataset, I get 54.06 instead of 53.99. What do I still need in order to interpolate over the TTM variable?

PS: I want to avoid masking via loop (instead of grouping) and setting the TTM as the index, because the dataframe is quite big and such a scenario takes considerable amount of time.

4   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  77  53.69
5   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  90  NaN
6   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  109 54.42
Share Improve this question edited Feb 7 at 17:01 ouroboros1 14.2k7 gold badges40 silver badges55 bronze badges asked Feb 7 at 16:15 Whitebeard13Whitebeard13 4332 gold badges8 silver badges19 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

Here's one approach:

cols = ['reference', 'sicovam', 'label', 'id', 'date']

df['price'] = (
    df.set_index('TTM')
    .groupby(cols)['price']
    .transform(lambda x: x.interpolate(method='index'))
    .values
    )

Output:

df.iloc[4:7]

            reference sicovam  label      id        date  TTM      price
4 SCOM_WTI   68801903     WTI  Nymex  BBG:CL  2015-01-02   77  53.690000
5 SCOM_WTI   68801903     WTI  Nymex  BBG:CL  2015-01-02   90  53.986562
6 SCOM_WTI   68801903     WTI  Nymex  BBG:CL  2015-01-02  109  54.420000

Explanation

  • Set column 'TTM' as the index with df.set_index and apply df.groupby.
  • Use groupby.transform and use pd.Series.interpolate with method='index'.
  • Finally, chain Series.values to assign the result back to df['price'].

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论