python - Subtle mistake in pandas .apply(lambda g: g.shift(1, fill_value=0).cumsum())

I have a dataframe that records the performance of F1-drivers and it looks like

Driver_ID   Date         Place
1          2025-02-13   1
1          2024-12-31   1
1          2024-11-03   2
1          2023-01-01   1
2          2025-01-13   5
2          2024-12-02   1
2          2024-11-12   2
2          2023-11-12   1
2          2023-05-12   1

and I want to create a new columns Total_wins which counts the number of wins of the driver before today's race, so the desired column looks like

Driver_ID   Date         Place  Total_wins
1          2025-02-13   1      2
1          2024-12-31   1      1
1          2024-11-03   2      1
1          2023-01-01   1      0
2          2025-01-13   5      3
2          2024-12-02   1      2
2          2024-11-12   2      2
2          2023-11-12   1      1
2          2023-05-12   1      0

And here is my code:

win = (df.assign(Date=Date)
         .sort_values(['Driver_ID','Date'], ascending=[True,True])
         ['Place'].eq(1))

df['Total_wins']=(win.groupby(df['Driver_ID'], group_keys=False).apply(lambda g: g.shift(1, fill_value=0).cumsum()))

So the code works (mostly) fine. I used mostly because I checked the result manually and most of the results are correct, but for a few rows, it gives wrong results like

Driver_ID   Date         Place  Total_wins
1          2025-02-13   1      2
1          2024-12-31   1      4
1          2024-11-03   2      1
1          2023-01-01   1      0

I tried to debug it but I couldn't find anything wrong. Is there any subtle mistake in my code that might have caused the mistake? Or what is the possible reason for this? My original dataframe is huge (~150000 rows)

Thank you so much in advance

I have a dataframe that records the performance of F1-drivers and it looks like

Driver_ID   Date         Place
1          2025-02-13   1
1          2024-12-31   1
1          2024-11-03   2
1          2023-01-01   1
2          2025-01-13   5
2          2024-12-02   1
2          2024-11-12   2
2          2023-11-12   1
2          2023-05-12   1

and I want to create a new columns Total_wins which counts the number of wins of the driver before today's race, so the desired column looks like

Driver_ID   Date         Place  Total_wins
1          2025-02-13   1      2
1          2024-12-31   1      1
1          2024-11-03   2      1
1          2023-01-01   1      0
2          2025-01-13   5      3
2          2024-12-02   1      2
2          2024-11-12   2      2
2          2023-11-12   1      1
2          2023-05-12   1      0

And here is my code:

win = (df.assign(Date=Date)
         .sort_values(['Driver_ID','Date'], ascending=[True,True])
         ['Place'].eq(1))

df['Total_wins']=(win.groupby(df['Driver_ID'], group_keys=False).apply(lambda g: g.shift(1, fill_value=0).cumsum()))

So the code works (mostly) fine. I used mostly because I checked the result manually and most of the results are correct, but for a few rows, it gives wrong results like

Driver_ID   Date         Place  Total_wins
1          2025-02-13   1      2
1          2024-12-31   1      4
1          2024-11-03   2      1
1          2023-01-01   1      0

Thank you so much in advance

Share Improve this question edited Feb 15 at 19:27 asked Feb 15 at 19:13 Ishigami 5313 silver badges11 bronze badges

I can't replicate your code, but shouldn't you be grouping by win['Driver_ID'] instead of df['Driver_ID']? Also looks like problems can arise by grouping on only wins dataframe with fewer rows but adding output back to df dataframe with original number of rows. – Jonathan Leon Commented Feb 16 at 4:43

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

I'm not sure what was wrong with your original code, but it seems you were quite close. Here's a modified version which works:

import pandas as pd

df = (
    pd.DataFrame(
        {
            "Driver_ID": [1, 1, 1, 1, 2, 2, 2, 2, 2],
            "Date": [
                "2025-02-13",
                "2024-12-31",
                "2024-11-03",
                "2023-01-01",
                "2025-01-13",
                "2024-12-02",
                "2024-11-12",
                "2023-11-12",
                "2023-05-12",
            ],
            "Place": [1, 1, 2, 1, 5, 1, 2, 1, 1],
        }
    )
    .sort_values(by=["Driver_ID", "Date"])
    .reset_index(drop=True)
)

df["Win?"] = df["Place"] == 1

df["Total_wins"] = df.groupby("Driver_ID", group_keys=False)["Win?"].apply(
    lambda g: g.shift(1, fill_value=0).cumsum()
)

df.drop(columns="Win?", inplace=True)

print(df)

This produces the following:

   Driver_ID        Date  Place Total_wins
0          1  2023-01-01      1          0
1          1  2024-11-03      2          1
2          1  2024-12-31      1          1
3          1  2025-02-13      1          2
4          2  2023-05-12      1          0
5          2  2023-11-12      1          1
6          2  2024-11-12      2          2
7          2  2024-12-02      1          2
8          2  2025-01-13      5          3

If you want to keep the original order, you can remove the reset_index(drop=True) part and then re-sort after the computation is done.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Subtle mistake in pandas .apply(lambda g: g.shift(1, fill_value=0).cumsum()) - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)