减少生成大量列表时的执行时间

本文介绍了减少生成大量列表时的执行时间的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我对Python还是很陌生，我正在尝试编写一些巨大的列表(内部带有随机字母).实际上，在我的机器上要花费2 50,000条线大约需要75-80秒.

I'm fairly new to Python, and I'm trying to write some huge lists (with random letters inside). Actually it takes me around 75 - 80 seconds on my machine for 2,000,000 lines.

import timeit import random, string global_tab = [] global_nb_loop = 2000000 print("Generate %d lines" % global_nb_loop) global_tab = [] for x in range(global_nb_loop): global_tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)]))) print("%d lines generated" % len(global_tab))

以及linux time命令的结果:

And the result with linux time command:

$ time python3 DEV/PyETL/generateList.py Generate 2000000 lines 2000000 lines generated real 1m16.844s user 1m16.609s sys 0m0.203s

当监视系统资源时，只有1个内核处于100％的状态，而不是像我在其上测试过的Windows机器上的4个那样，使我感到惊讶.

I was surprised when monitoring system resources that only 1 core was at 100%, instead of 4 like on a Windows machine on which I've tested this too.

我当然已经尝试应用一些线程，但是我面临一个问题:它比在单个内核上运行需要更多的时间.也许线程不是解决方案，或者我可能错误地使用了它们.

Of course I've tried to apply some threads, but I'm facing a problem: it takes more time than running on a single core. Maybe threads are not the solution or I'm probably using them wrong.

这是新代码:

import random, string import threading global_tab = [] global_nb_threads = 4 global_nb_loop = 2000000 threadLock = threading.Lock() class generateList(threading.Thread): def __init__(self, name): threading.Thread.__init__(self) self.name = name def run(self): global global_tab self.tab = [] print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads))) # divide desirated lines with number of threads for x in range(int(global_nb_loop/global_nb_threads)): self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)]))) threadLock.acquire() global_tab += self.tab threadLock.release() del self.tab print("[%s] %d lines in list" % (self.name, len(global_tab))) for i in range(global_nb_threads): # Create threads t = generateList("Thread-" + str(i)) # Start t.start() for i in range(global_nb_threads): # Wait for threads end t.join()

执行:

$ time python3 DEV/PyETL/generateListThreads.py [Thread-0] Generate 500000 lines [Thread-1] Generate 500000 lines [Thread-2] Generate 500000 lines [Thread-3] Generate 500000 lines [Thread-3] 500000 lines in list [Thread-0] 1000000 lines in list [Thread-2] 1500000 lines in list [Thread-1] 2000000 lines in list real 1m40.858s user 1m41.208s sys 0m0.916s

32秒比1个内核多100％，但是监视显示这8个内核同时具有20％至40％的负载.

32 seconds more than 1 core with 100%, but monitoring shows that the 8 cores were with 20 - 40% load at the same time.

由于所有线程都在同时工作，因此生成的行数更少并且仅用于更新全局变量而进行同步，因此执行时间不应该少于单个内核吗?

Since all threads are working at the same time, generating fewer rows and synchronizing only for updating a global variable, shouldn't the execution time be lower than a single core?

推荐答案

我很确定您的锁不是必需的，并且会降低您的速度. (实际上，我只是注意到锁是在之后使用的，所以大部分工作都没有完成，所以并不是很重要.)

I am pretty sure your lock is not necessary and is slowing you down. (edit: actually, I just noticed the lock is used after the majority of the work is done, so isn't really relevant.)

global_tab += self.tab(我认为)是通过Python GIL构成的原子. (实际上，此只能声明list.extend()，请改用它.这是另一个参考:列表线程安全吗?

global_tab += self.tab is (I think) atomic through the Python GIL. (Actually, this only claims list.extend(), so use that instead. Here's another reference: Are lists thread safe?

或者，我会尝试使用大块大小的multiprocessing.imap_unordered.不利之处在于结果是按流发送的，但是您的随机字符串处理可能会使结果黯然失色.

Alternatively, I would try multiprocessing.imap_unordered with a large chunksize. The downside is the results are sent over by stream, but your random string processing might overshadow that.

import multiprocessing import random import string def randomword(x): return ''.join(random.choice(string.ascii_letters) for i in range(15)) pool = multiprocessing.Pool(8) results = pool.imap_unordered(randomword, range(100)) print([r for r in results])

对于200万个字符串(我将其更改为打印长度):

For 2 million strings (I changed it to print the length):

$ time python r.py 2000000 real 0m38.305s user 1m31.717s sys 0m25.853s

我也尝试清理一下您的版本并得到:

I also tried cleaning up your version a bit and got:

$ time python rr.py [Thread-0] Generate 250000 lines [Thread-1] Generate 250000 lines [Thread-2] Generate 250000 lines [Thread-3] Generate 250000 lines [Thread-4] Generate 250000 lines [Thread-5] Generate 250000 lines [Thread-6] Generate 250000 lines [Thread-7] Generate 250000 lines [Thread-4] 250000 lines in list [Thread-1] 500000 lines in list [Thread-7] 750000 lines in list [Thread-0] 1000000 lines in list [Thread-6] 1250000 lines in list [Thread-2] 1500000 lines in list [Thread-3] 1750000 lines in list [Thread-5] 2000000 lines in list real 0m22.113s user 0m24.969s sys 0m5.537s

几个重要的变化:

在大范围内使用xrange()(啊，python3已经做到了.)
删除线程锁
在全局上使用extend().

use xrange() on the large ranges (ah, python3 already does this.)
remove the threadlock
use extend() on the global.

(当我将结果附加到global_tab，顺便说一句，并且不包括临时列表时，我的结果大致相同.)

(my results were about the same when just appending to the global_tab, btw, and leaving out the temporary list.)

import random, string import threading global_tab = [] global_nb_threads = 8 global_nb_loop = 2000000 class generateList(threading.Thread): def __init__(self, name): threading.Thread.__init__(self) self.name = name def run(self): global global_tab self.tab = [] print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads))) for x in range(int(global_nb_loop/global_nb_threads)): self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)]))) global_tab.extend(self.tab) print("[%s] %d lines in list" % (self.name, len(global_tab))) for i in range(global_nb_threads): t = generateList("Thread-" + str(i)) t.start() for i in range(global_nb_threads): t.join()

...但是，单线程仍在16秒时稍微快一点.

...but, single threaded is still slightly faster at 16 seconds.

如果我调multiprocessing，我可以将其降低到6秒:

If I tune multiprocessing, I can get it down to 6 seconds:

size = 2000000 processes = 8 pool = multiprocessing.Pool(processes) results = [r for r in pool.imap_unordered(randomword, range(size), chunksize=int(size/processes))] print(len(results))

输出:

$ time python r.py 2000000 real 0m5.713s user 0m35.594s sys 0m0.546s

...所以我认为这是我的最终答案:使用multiprocessing.

...so I think that's my final answer: Use multiprocessing.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

减少生成大量列表时的执行时间

与本文相关的文章

评论列表(0)