最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Dataframe can't multiprocess and reference in functions - Stack Overflow

programmeradmin4浏览0评论

I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.

I'm trying to have:

  1. First block of code run and finish normally.
  2. Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
  3. 2nd block of code runs after Function 3 and references all lines above.

Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:

  1. New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
  2. When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.

Part of my code so far. Newish to Python but trying.

import pandas as pd
import multiprocessing as mp

## I pull some data

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...

def functionone():
    df = pd.DataFrame()
    df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
    ...

def functiontwo():
    df = pd.DataFrame()
    ...

def functionthree():
    df = pd.DataFrame()
    df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
    ...

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace( 
    go.Bar( 
        x=df['Date'],
        y=df['volume'],
        ...
    ), secondary_y=True,
)


if __name__ == '__main__':

    p1 = mp.Process(target=functionone)
    p2 = mp.Process(target=functontwo)
    p3 = mp.Process(target=functionthree)

    p1.start()
    p2.start()
    p3.start()
    ...

I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.

I'm trying to have:

  1. First block of code run and finish normally.
  2. Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
  3. 2nd block of code runs after Function 3 and references all lines above.

Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:

  1. New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
  2. When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.

Part of my code so far. Newish to Python but trying.

import pandas as pd
import multiprocessing as mp

## I pull some data

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...

def functionone():
    df = pd.DataFrame()
    df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
    ...

def functiontwo():
    df = pd.DataFrame()
    ...

def functionthree():
    df = pd.DataFrame()
    df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
    ...

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace( 
    go.Bar( 
        x=df['Date'],
        y=df['volume'],
        ...
    ), secondary_y=True,
)


if __name__ == '__main__':

    p1 = mp.Process(target=functionone)
    p2 = mp.Process(target=functontwo)
    p3 = mp.Process(target=functionthree)

    p1.start()
    p2.start()
    p3.start()
    ...
Share Improve this question edited Apr 3 at 7:12 jottbe 4,5312 gold badges18 silver badges35 bronze badges asked Mar 30 at 17:44 beezy4deuxbeezy4deux 111 bronze badge 8
  • if column F depends on column B which depends on column A then using multiprocessing seems useless - they have to be calculated one after another. Or maybe you should apply function which create both values in the same time and assign them to two columns. – furas Commented Mar 30 at 20:09
  • multiprocessing doesn't share memory and it has to send all data from main process to other processes - and it can take time. – furas Commented Mar 30 at 20:12
  • df[ ['market_9', 'nine_score'] ] = df.apply(lambda x: ["9", x["strength] ] if ... else [None, None], axis=1, result_type='expand') – furas Commented Mar 30 at 20:23
  • you may also try to use polars instead of pandas - maybe it will work faster. – furas Commented Mar 30 at 20:24
  • df = pd.DataFrame() within each function creates a new empty dataframe each time which therefore does not have any values which can be used. Please show a minimal reproducible example. – user19077881 Commented Mar 30 at 22:41
 |  Show 3 more comments

1 Answer 1

Reset to default 1

The underlying functions of Pandas/Numpy are mostly using C libraries; but when you use df.apply, you throw those out the window. Generally, there's a better way if you look into their documentation.

For example, what you have here:

df['market_9'] = df.apply(
    lambda x: "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None,
    axis=1,
)

Could be re-written as:

df.loc[df.Date.ge(df.market_9_start) & df.Date.lt(df.market_9_end), "market_9"] = "9"
For a far greater speed increase than trying to use multiprocessing.

Another example - Change this:

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))

Into this:

df["Date"] = pd.to_datetime(df["timestamp"], unit="s")
发布评论

评论列表(0)

  1. 暂无评论