最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Pandas performance while iterating a state vector - Stack Overflow

programmeradmin1浏览0评论

I want to make a pandas dataframe that describes the state of a system at different times

  • I have the initial state which describes the first row
  • Each row correspond to a time
  • I have reveserved the first two columns for "household" / statistics
  • The following columns are state parameters
  • At each iteration/row a number of parameters change - this could be just one or many

I have created a somewhat simplified version that simulates my change data : df_change

Question 1

Can you think of a more efficient way of generating the matrix than what i do in this code? i have a state that i update in a loop and insert

Question 2

This is what i discovered while trying to write the sample code for this discussion. I see 20 fold performanne boost in loop iteration performance if i do the assignments to the "household" columns after the loop. Why is this? I am using python = 3.12.4 and pandas 2.2.2.

df["product"] ="some_product"
#%%

import numpy as np
import pandas as pd
from tqdm import tqdm
num_cols =600
n_changes = 40000


# simulate changes

extra_colums = ["n_changes","product"]
columns = [chr(i+65) for i in range(num_cols)]

state = { icol : np.random.random() for icol in columns}

change_index = np.random.randint(0,4,n_changes).cumsum()
change_col =  [columns[np.random.randint(0,num_cols)] for i in range(n_changes)]
change_val= np.random.normal(size=n_changes)

# create change matrix
df_change=pd.DataFrame(index= change_index )
df_change['col'] = change_col
df_change['val'] = change_val
index = np.unique(change_index)


# %%
# Slow approach  gives 5 iterations/s
df = pd.DataFrame(index= index, columns=extra_colums + columns)
df["product"] ="some_product"
for i in tqdm(index):
    state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
    df.loc[i,columns] = pd.Series(state)

# %%
# Fast approach gives 1000 iterations/sec
df2 = pd.DataFrame(index= index, columns=extra_colums + columns)
for i in tqdm(index):
    state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
    df2.loc[i,columns] = pd.Series(state)
df2["product"] ="some_product"

Edit

I marked the answer by ouroboros1 as theaccepted solution - it works really well and answered Question 1.

I am still curios about Question 2 : the difference in pandas performance using the two methods where i iterate through the rows. I found that I can also get a performance similar to the original "df2" method depending on how i assign the value before the loop.

The interesting point here is that pre assignment changes the performance in loop that follows.

# Fast approach gives 1000 iterations/sec
df3 = pd.DataFrame(index=index, columns=extra_colums + columns)

#df3.loc[index,"product"] = "some_product" # Fast  
#df3["product"] = "some_product"           # Slow  
df3.product = "some_product"               # Fast

for i in tqdm(index):
    state.update(zip(df_change.loc[[i], "col"], df_change.loc[[i], "val"]))
    df3.loc[i, columns] = np.array(list(state.values())) 



I want to make a pandas dataframe that describes the state of a system at different times

  • I have the initial state which describes the first row
  • Each row correspond to a time
  • I have reveserved the first two columns for "household" / statistics
  • The following columns are state parameters
  • At each iteration/row a number of parameters change - this could be just one or many

I have created a somewhat simplified version that simulates my change data : df_change

Question 1

Can you think of a more efficient way of generating the matrix than what i do in this code? i have a state that i update in a loop and insert

Question 2

This is what i discovered while trying to write the sample code for this discussion. I see 20 fold performanne boost in loop iteration performance if i do the assignments to the "household" columns after the loop. Why is this? I am using python = 3.12.4 and pandas 2.2.2.

df["product"] ="some_product"
#%%

import numpy as np
import pandas as pd
from tqdm import tqdm
num_cols =600
n_changes = 40000


# simulate changes

extra_colums = ["n_changes","product"]
columns = [chr(i+65) for i in range(num_cols)]

state = { icol : np.random.random() for icol in columns}

change_index = np.random.randint(0,4,n_changes).cumsum()
change_col =  [columns[np.random.randint(0,num_cols)] for i in range(n_changes)]
change_val= np.random.normal(size=n_changes)

# create change matrix
df_change=pd.DataFrame(index= change_index )
df_change['col'] = change_col
df_change['val'] = change_val
index = np.unique(change_index)


# %%
# Slow approach  gives 5 iterations/s
df = pd.DataFrame(index= index, columns=extra_colums + columns)
df["product"] ="some_product"
for i in tqdm(index):
    state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
    df.loc[i,columns] = pd.Series(state)

# %%
# Fast approach gives 1000 iterations/sec
df2 = pd.DataFrame(index= index, columns=extra_colums + columns)
for i in tqdm(index):
    state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
    df2.loc[i,columns] = pd.Series(state)
df2["product"] ="some_product"

Edit

I marked the answer by ouroboros1 as theaccepted solution - it works really well and answered Question 1.

I am still curios about Question 2 : the difference in pandas performance using the two methods where i iterate through the rows. I found that I can also get a performance similar to the original "df2" method depending on how i assign the value before the loop.

The interesting point here is that pre assignment changes the performance in loop that follows.

# Fast approach gives 1000 iterations/sec
df3 = pd.DataFrame(index=index, columns=extra_colums + columns)

#df3.loc[index,"product"] = "some_product" # Fast  
#df3["product"] = "some_product"           # Slow  
df3.product = "some_product"               # Fast

for i in tqdm(index):
    state.update(zip(df_change.loc[[i], "col"], df_change.loc[[i], "val"]))
    df3.loc[i, columns] = np.array(list(state.values())) 



Share Improve this question edited Feb 5 at 13:30 user1573820 asked Feb 4 at 15:53 user1573820user1573820 14711 bronze badges 10
  • almost certainly, it would be faster not to update a pandas dataframe row by row. All at once would be best, or even just large chunks if you cannot spare the memory – juanpa.arrivillaga Commented Feb 4 at 16:08
  • as for why moving the df["product"] ="some_product" changes things so much, I suspect it's because in the latter case you are working with a single block because you have a single dtype, but in the first case, you are working with two blocks, although the internals of pandas have changed quite a lot since I last dug into them. But you can inspect df._data to see what's going on if you want to dig into the guts of it. I would also try using a pyarrow backed dataframe and see how it compares – juanpa.arrivillaga Commented Feb 4 at 16:12
  • Note, btw, it looks like you are creating an all object-dtype dataframe, but it's still split up into three blocks when you do df["product"] ="some_product" first! – juanpa.arrivillaga Commented Feb 4 at 16:19
  • @juanpa.arrivillaga : I tried to look through you clues, but couldn't get closer to an answer for Question2. I have edited the question with some new knowledge, but i am still puzzled – user1573820 Commented Feb 5 at 13:33
  • 1 Thanks for taking time to answer. What baffled me was that the way i assign column "products" before the loop has an impact on how the loop performs. so maybe that should have it's own question. Your solution showed that i don't need a loop. – user1573820 Commented Feb 5 at 15:10
 |  Show 5 more comments

1 Answer 1

Reset to default 2

Here's one approach that should be much faster:

Data sample

num_cols = 4
n_changes = 6
np.random.seed(0) # reproducibility

# setup ...
df_change

   col       val
1    C  0.144044
4    A  1.454274
5    A  0.761038
7    A  0.121675
7    C  0.443863
10   B  0.333674
state

{'A': 0.5488135039273248,
 'B': 0.7151893663724195,
 'C': 0.6027633760716439,
 'D': 0.5448831829968969}

Code

out = (df_change.reset_index()
       .pivot_table(index='index', 
                    columns='col', 
                    values='val',
                    aggfunc='last')
       .rename_axis(index=None, columns=None)
       .assign(product='some_product')
       .reindex(columns=extra_colums + columns)
       .fillna(pd.DataFrame(state, index=[index[0]]))
       .ffill()
       )

Output

    n_changes       product         A         B         C         D
1         NaN  some_product  0.548814  0.715189  0.144044  0.544883
4         NaN  some_product  1.454274  0.715189  0.144044  0.544883
5         NaN  some_product  0.761038  0.715189  0.144044  0.544883
7         NaN  some_product  0.121675  0.715189  0.443863  0.544883
10        NaN  some_product  0.121675  0.333674  0.443863  0.544883

# note:
# A updated in 4, 5, 7
# B updated in 10
# C updated in 1, 7

Explanation / Intermediates

  • Use df.reset_index to access 'index' inside df.pivot_table. For the aggfunc use 'last'. I.e., we only need to propagate the last value in case of duplicate 'col' values per index value.
  • Cosmetic: use df.rename_axis to reset index and columns names to None.
# df_chagne.reset_index().pivot_table(...).rename_axis(...)

           A         B         C
1        NaN       NaN  0.144044
4   1.454274       NaN       NaN
5   0.761038       NaN       NaN
7   0.121675       NaN  0.443863
10       NaN  0.333674       NaN
  • Use df.assign to add column 'product' with a scalar ('some_product').
  • Use df.reindex to get the columns in the desired order (with extra_columns up front). Not yet existing column 'n_changes' will be added with NaN values.
  • Now, apply df.fillna and use a pd.DataFrame with state for only the first index value (index[0]), to fill the first row (alternatively, use dfbine_first).
# after .fillna(...)

    n_changes       product         A         B         C         D
1         NaN  some_product  0.548814  0.715189  0.144044  0.544883
4         NaN  some_product  1.454274       NaN       NaN       NaN
5         NaN  some_product  0.761038       NaN       NaN       NaN
7         NaN  some_product  0.121675       NaN  0.443863       NaN
10        NaN  some_product       NaN  0.333674       NaN       NaN
  • Finally, we want to forward fill: df.ffill.

Performance comparison:

num_cols = 100
n_changes = 100
np.random.seed(0) # reproducibility

# out:
7.01 ms ± 435 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# df2 (running this *afterwards*, as you are updating `state`
93.7 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Equality check:

df2.astype(out.dtypes).equals(out)
# True
发布评论

评论列表(0)

  1. 暂无评论