I have a dataframe that records the performance of F1-drivers and it looks like
Driver_ID Date Place
1 2025-02-13 1
1 2024-12-31 1
1 2024-11-03 2
1 2023-01-01 1
2 2025-01-13 5
2 2024-12-02 1
2 2024-11-12 2
2 2023-11-12 1
2 2023-05-12 1
and I want to create a new columns Total_wins
which counts the number of wins of the driver before today's race, so the desired column looks like
Driver_ID Date Place Total_wins
1 2025-02-13 1 2
1 2024-12-31 1 1
1 2024-11-03 2 1
1 2023-01-01 1 0
2 2025-01-13 5 3
2 2024-12-02 1 2
2 2024-11-12 2 2
2 2023-11-12 1 1
2 2023-05-12 1 0
And here is my code:
win = (df.assign(Date=Date)
.sort_values(['Driver_ID','Date'], ascending=[True,True])
['Place'].eq(1))
df['Total_wins']=(win.groupby(df['Driver_ID'], group_keys=False).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
So the code works (mostly) fine. I used mostly because I checked the result manually and most of the results are correct, but for a few rows, it gives wrong results like
Driver_ID Date Place Total_wins
1 2025-02-13 1 2
1 2024-12-31 1 4
1 2024-11-03 2 1
1 2023-01-01 1 0
I tried to debug it but I couldn't find anything wrong. Is there any subtle mistake in my code that might have caused the mistake? Or what is the possible reason for this? My original dataframe is huge (~150000 rows)
Thank you so much in advance
I have a dataframe that records the performance of F1-drivers and it looks like
Driver_ID Date Place
1 2025-02-13 1
1 2024-12-31 1
1 2024-11-03 2
1 2023-01-01 1
2 2025-01-13 5
2 2024-12-02 1
2 2024-11-12 2
2 2023-11-12 1
2 2023-05-12 1
and I want to create a new columns Total_wins
which counts the number of wins of the driver before today's race, so the desired column looks like
Driver_ID Date Place Total_wins
1 2025-02-13 1 2
1 2024-12-31 1 1
1 2024-11-03 2 1
1 2023-01-01 1 0
2 2025-01-13 5 3
2 2024-12-02 1 2
2 2024-11-12 2 2
2 2023-11-12 1 1
2 2023-05-12 1 0
And here is my code:
win = (df.assign(Date=Date)
.sort_values(['Driver_ID','Date'], ascending=[True,True])
['Place'].eq(1))
df['Total_wins']=(win.groupby(df['Driver_ID'], group_keys=False).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
So the code works (mostly) fine. I used mostly because I checked the result manually and most of the results are correct, but for a few rows, it gives wrong results like
Driver_ID Date Place Total_wins
1 2025-02-13 1 2
1 2024-12-31 1 4
1 2024-11-03 2 1
1 2023-01-01 1 0
I tried to debug it but I couldn't find anything wrong. Is there any subtle mistake in my code that might have caused the mistake? Or what is the possible reason for this? My original dataframe is huge (~150000 rows)
Thank you so much in advance
Share Improve this question edited Feb 15 at 19:27 Ishigami asked Feb 15 at 19:13 IshigamiIshigami 5313 silver badges11 bronze badges 1- I can't replicate your code, but shouldn't you be grouping by win['Driver_ID'] instead of df['Driver_ID']? Also looks like problems can arise by grouping on only wins dataframe with fewer rows but adding output back to df dataframe with original number of rows. – Jonathan Leon Commented Feb 16 at 4:43
1 Answer
Reset to default 1I'm not sure what was wrong with your original code, but it seems you were quite close. Here's a modified version which works:
import pandas as pd
df = (
pd.DataFrame(
{
"Driver_ID": [1, 1, 1, 1, 2, 2, 2, 2, 2],
"Date": [
"2025-02-13",
"2024-12-31",
"2024-11-03",
"2023-01-01",
"2025-01-13",
"2024-12-02",
"2024-11-12",
"2023-11-12",
"2023-05-12",
],
"Place": [1, 1, 2, 1, 5, 1, 2, 1, 1],
}
)
.sort_values(by=["Driver_ID", "Date"])
.reset_index(drop=True)
)
df["Win?"] = df["Place"] == 1
df["Total_wins"] = df.groupby("Driver_ID", group_keys=False)["Win?"].apply(
lambda g: g.shift(1, fill_value=0).cumsum()
)
df.drop(columns="Win?", inplace=True)
print(df)
This produces the following:
Driver_ID Date Place Total_wins
0 1 2023-01-01 1 0
1 1 2024-11-03 2 1
2 1 2024-12-31 1 1
3 1 2025-02-13 1 2
4 2 2023-05-12 1 0
5 2 2023-11-12 1 1
6 2 2024-11-12 2 2
7 2 2024-12-02 1 2
8 2 2025-01-13 5 3
If you want to keep the original order, you can remove the reset_index(drop=True)
part and then re-sort after the computation is done.