python - multiprocess library barely works

I'm using the multiprocess library to accelerate a CPU-bound task (a method inside a user-defined class).

The function processes a page of a document, in my example a 500-page document takes around 20 seconds sequentially (so around 0.04 seconds per page). I'm incrementing an index until 2,000,000 to simulate this task.

dummy.py

from multiprocess import Pool

class DummyClass:

    def __init__(self, workers=1):
        self.workers = workers

    # Simulate CPU-intensive task
    def _process_one(self, page):
        count = 0
        while count < 2_000_000:
            count += 1
        return page

    # Process with "multiprocess"
    def multiprocess(self, pages):
        with Pool(processes=self.workers) as pool:
            async_results = pool.map_async(self._process_one, pages)
            extraction = async_results.get()
            return extraction
        
    # Process sequentially
    def sequential(self, pages):
        extraction = []
        for page in pages:
            extraction.append(self._process_one(page))
        return extraction

test.py

from dummy import DummyClass

# Sequential with dummy method

def dummy_sequential():
    
    dummy_extractor = DummyClass()
    extraction = dummy_extractor.sequential(range(500))
    return extraction


# Multiprocessing with dummy method

def dummy_multiprocess(workers):

    dummy_extractor = DummyClass(workers=workers)
    extraction = dummy_extractor.multiprocess(range(500))
    return extraction

Testing sequential:

if __name__ == "__main__":

    ini = time.time()
    extraction = dummy_sequential()
    fin = time.time()
    print("Time: ", fin - ini, "seconds")

Prints out:

Time:  19.12088394165039 seconds

Testing multiprocess with different values:

if __name__ == "__main__":

    for i in range(2, 9):
        ini = time.time()
        extraction = dummy_multiprocess(workers=i)
        fin = time.time()
        print(f"Time with {i} workers", fin - ini, "seconds")

Prints out:

Time with 2 workers 13.7001051902771 seconds
Time with 3 workers 11.189585208892822 seconds
Time with 4 workers 11.595974683761597 seconds
Time with 5 workers 12.016109228134155 seconds
Time with 6 workers 12.690005540847778 seconds
Time with 7 workers 13.012137651443481 seconds
Time with 8 workers 13.412734508514404 seconds

So we can see 3 workers is the optimal number of workers, while with more workers it slowly climbs up again.

However this is a process that needs to be as fast as possible. If a 500-page document takes 20 seconds, I would like it to go under 2 seconds (my computer has 16 CPU cores). The fastest I can get now is 11 seconds.

I understand this process has some overhead, but this seems too much. Is there no other way to make it faster?

Thank you

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - multiprocess library barely works - Stack Overflow

与本文相关的文章

评论列表(0)