I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.
I'm trying to have:
- First block of code run and finish normally.
- Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
- 2nd block of code runs after Function 3 and references all lines above.
Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:
- New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
- When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.
Part of my code so far. Newish to Python but trying.
import pandas as pd
import multiprocessing as mp
## I pull some data
df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...
def functionone():
df = pd.DataFrame()
df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
...
def functiontwo():
df = pd.DataFrame()
...
def functionthree():
df = pd.DataFrame()
df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
...
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Bar(
x=df['Date'],
y=df['volume'],
...
), secondary_y=True,
)
if __name__ == '__main__':
p1 = mp.Process(target=functionone)
p2 = mp.Process(target=functontwo)
p3 = mp.Process(target=functionthree)
p1.start()
p2.start()
p3.start()
...
I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.
I'm trying to have:
- First block of code run and finish normally.
- Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
- 2nd block of code runs after Function 3 and references all lines above.
Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:
- New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
- When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.
Part of my code so far. Newish to Python but trying.
import pandas as pd
import multiprocessing as mp
## I pull some data
df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...
def functionone():
df = pd.DataFrame()
df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
...
def functiontwo():
df = pd.DataFrame()
...
def functionthree():
df = pd.DataFrame()
df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
...
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Bar(
x=df['Date'],
y=df['volume'],
...
), secondary_y=True,
)
if __name__ == '__main__':
p1 = mp.Process(target=functionone)
p2 = mp.Process(target=functontwo)
p3 = mp.Process(target=functionthree)
p1.start()
p2.start()
p3.start()
...
Share
Improve this question
edited Apr 3 at 7:12
jottbe
4,5312 gold badges18 silver badges35 bronze badges
asked Mar 30 at 17:44
beezy4deuxbeezy4deux
111 bronze badge
8
|
Show 3 more comments
1 Answer
Reset to default 1The underlying functions of Pandas/Numpy are mostly using C libraries; but when you use df.apply
, you throw those out the window. Generally, there's a better way if you look into their documentation.
For example, what you have here:
df['market_9'] = df.apply(
lambda x: "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None,
axis=1,
)
Could be re-written as:
df.loc[df.Date.ge(df.market_9_start) & df.Date.lt(df.market_9_end), "market_9"] = "9"
For a far greater speed increase than trying to use multiprocessing.
Another example - Change this:
df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
Into this:
df["Date"] = pd.to_datetime(df["timestamp"], unit="s")
df[ ['market_9', 'nine_score'] ] = df.apply(lambda x: ["9", x["strength] ] if ... else [None, None], axis=1, result_type='expand')
– furas Commented Mar 30 at 20:23polars
instead ofpandas
- maybe it will work faster. – furas Commented Mar 30 at 20:24df = pd.DataFrame()
within each function creates a new empty dataframe each time which therefore does not have any values which can be used. Please show a minimal reproducible example. – user19077881 Commented Mar 30 at 22:41