Here is the context to what I am trying to do:
- I have several data blocks, that each consist of either 6 items or 24 items, and each item is analyzed separately. The analysis code is not mine. For reasons beyond my control, each item needs to be processed single-threaded.
- But I have made a script/function that processes one such
data block at a time, by spawning 6 or 24 processes as needed, one
process per item, using
python.joblib.Parallel()
. It works great. - I can know in advance if a block has 6 or 24 items.
- The computation of a single item lasts approximately 4hrs. A block of items running in parallel lasts pretty much the same. So an ideal case for parallelization.
- I have a single workstation with a total of 40 threads available to use. So I cannot run more than one 24-item block, but it leaves enough room for two 6-item blocks to run as well. And if the big blocks finish early, there would be room for 6 of the small blocks to run at once.
- The number of 24-item and 6-item data blocks in the pool is not necessarily equal or fixed.
Here is what I'd like to do:
I would like to run a whole pool of mixed-size data blocks from a wrapper script, while minimizing the overall time it takes, by not having too many idle cores.
My initial half-baked idea was to have the data blocks in two pools, by size, and have two calls to
Parallel
to run my block-processing function, one submitting single jobs from the pool of large blocks and the other submitting two jobs from the pool of small blocks. But then I realized thatParallel
will wait for tasks to complete, so the second pool would only run after the first pool is done.I know cluster computing schedulers handle this kind of stuff, but I am on a single workstation and don't have a scheduler. The data files are too big for our network bandwidth, so buying some cloud computing and scheduling the jobs there is not practical at all.
Dissolving the data blocks and creating one massive pool of single items from across all the data blocks would probably be possible and it would be the easiest and most effective to then parallelize, but it would be a non-negligible amount of effort on my part rethinking and refactoring my processing code to accommodate that. I may do it in the long term, but in the shorter term I'd like another way.
The last option I can think of, is... to have two wrapper script instances, one instance for the large blocks sending single tasks, and one instance for the small blocks, sending pairs of tasks, and rely on bash syntax to have them run at the same time. But that feels... unsatisfactory.
Is there a tidy way to do this, without over-complicating my setup?
PS.: Actually I don't even know if I can nest calls to Parallel
with the innermost one spawning more processes than the outermost ones n_jobs
, I haven't tried it yet as I realized my initial plan wasn't going to work and I haven't come up with a better one yet. (And I am aware it is probably bad programming design.)
System
python 3.12.8, on an old HP workstation with Ubuntu 22.04 LTS. I'm using the default Parallel
backend, I am not IT-savvy enough to make an informed choice there.