I am trying to set up jobs with multiple steps, essentially running many independent copies of the same program on a single core each time. I decided to use this approach instead of job arrays, as there is a limit of 20 jobs per user on the cluster I am accessing, while the maximum steps per job is set to the default 40000. An example batch script looks like:
#!/bin/sh
#SBATCH --partition parallel
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1
#SBATCH --mem=1000M
#SBATCH --job-name test
#SBATCH --output test.out
for ((i=0; i<$SLURM_NTASKS; i++))
do
srun -N1 -n1 --mem-per-cpu=10M --exact -u sleep 1200 &
done
wait
From my understanding, the expected behavior of the above script is that after the allocation of resources, 100 job steps will be launched in parallel, each one taking up a single CPU. I have also included explicit memory allocation and the --exact
flag, as suggested on a similar post to prevent the first srun from taking up the entire allocated memory: parallel but different Slurm srun job step invocations not working
Nevertheless, I still end up getting the message Step creation temporarily disabled, retrying (Requested nodes are busy)
, after a few (10-20) job steps start running. Since all necessary resources are being allocated and memory is being evenly distributed (verified by checking with sacct
), what could be preventing all the jobs from running at the same time?
PS. I am adding below a typical sacct
output. Everything seems to follow the expected behavior. The only things that are not clear to me are:
- The fact that the
.batch
step seems to take up many CPUs and more memory than is actually allocated to it - The fact that even though steps are allocated on 5 nodes, almost all of them are concentrated in only one node. This is the typical behavior of all my tests so far, so I cannot consider it a coincidence.
JobID MaxRSS AllocTRES NodeList
--------------- ---------- ---------------------------------------- --------------------
624245 billing=441,cpu=100,mem=1500M,node=5 ibiscohpc-wn[26-30]
624245.batch 408612K cpu=22,mem=330M,node=1 ibiscohpc-wn26
624245.0 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.1 632K cpu=1,mem=10M,node=1 ibiscohpc-wn27
624245.2 628K cpu=1,mem=10M,node=1 ibiscohpc-wn28
624245.3 632K cpu=1,mem=10M,node=1 ibiscohpc-wn29
624245.4 632K cpu=1,mem=10M,node=1 ibiscohpc-wn30
624245.5 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.6 636K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.7 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.8 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.9 636K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.10 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.11 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.12 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.13 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.14 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.15 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.16 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.17 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.18 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.19 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.20 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.21 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.22 636K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.23 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.24 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.25 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
I am trying to set up jobs with multiple steps, essentially running many independent copies of the same program on a single core each time. I decided to use this approach instead of job arrays, as there is a limit of 20 jobs per user on the cluster I am accessing, while the maximum steps per job is set to the default 40000. An example batch script looks like:
#!/bin/sh
#SBATCH --partition parallel
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1
#SBATCH --mem=1000M
#SBATCH --job-name test
#SBATCH --output test.out
for ((i=0; i<$SLURM_NTASKS; i++))
do
srun -N1 -n1 --mem-per-cpu=10M --exact -u sleep 1200 &
done
wait
From my understanding, the expected behavior of the above script is that after the allocation of resources, 100 job steps will be launched in parallel, each one taking up a single CPU. I have also included explicit memory allocation and the --exact
flag, as suggested on a similar post to prevent the first srun from taking up the entire allocated memory: parallel but different Slurm srun job step invocations not working
Nevertheless, I still end up getting the message Step creation temporarily disabled, retrying (Requested nodes are busy)
, after a few (10-20) job steps start running. Since all necessary resources are being allocated and memory is being evenly distributed (verified by checking with sacct
), what could be preventing all the jobs from running at the same time?
PS. I am adding below a typical sacct
output. Everything seems to follow the expected behavior. The only things that are not clear to me are:
- The fact that the
.batch
step seems to take up many CPUs and more memory than is actually allocated to it - The fact that even though steps are allocated on 5 nodes, almost all of them are concentrated in only one node. This is the typical behavior of all my tests so far, so I cannot consider it a coincidence.
JobID MaxRSS AllocTRES NodeList
--------------- ---------- ---------------------------------------- --------------------
624245 billing=441,cpu=100,mem=1500M,node=5 ibiscohpc-wn[26-30]
624245.batch 408612K cpu=22,mem=330M,node=1 ibiscohpc-wn26
624245.0 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.1 632K cpu=1,mem=10M,node=1 ibiscohpc-wn27
624245.2 628K cpu=1,mem=10M,node=1 ibiscohpc-wn28
624245.3 632K cpu=1,mem=10M,node=1 ibiscohpc-wn29
624245.4 632K cpu=1,mem=10M,node=1 ibiscohpc-wn30
624245.5 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.6 636K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.7 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.8 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.9 636K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.10 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.11 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.12 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.13 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.14 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.15 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.16 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.17 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.18 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.19 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.20 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.21 632K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.22 636K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.23 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.24 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
624245.25 628K cpu=1,mem=10M,node=1 ibiscohpc-wn26
Share
Improve this question
edited Nov 21, 2024 at 11:59
Mark Rotteveel
109k229 gold badges156 silver badges221 bronze badges
asked Nov 20, 2024 at 16:24
ChristosChristos
112 bronze badges
4
|
1 Answer
Reset to default 0I have not been able to find a satisfying answer to submitting tasks on multiple nodes using job steps. However I found that in my case (multiple identical runs) what works really well is to submit only one job step split in many tasks. The batch script would then look like:
#!/bin/sh
#SBATCH --partition parallel
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100M
#SBATCH --job-name test
#SBATCH --output test.out
srun -n100 -u exec.sh
with the executable script exec.sh
containing expressions with the variable $SLURM_PROCID
to differentiate between the tasks. For example:
#!/bin/sh
echo $SLURM_PROCID
sleep 1200
This will result in the desired behavior, but from what I understand it has some drawbacks compared to submitting separate job steps when it comes to the independently controlling each task. However, until a better alternative is found, this is the only approach that seems to work for this use case.
--mem=1000M
with--mem-per-cpu=15M
in the header of the submission script? the--mem
option is per job and does not necessarily ensure that each task has access to the same portion of allocated memory on each node. Or, if the nodes have more than 100 CPUs, use --nodes=1 to run all tasks on the same node, in which case--mem
does not have that limitation – damienfrancois Commented Nov 21, 2024 at 10:39--mem-per-cpu=1G
in the submission script? I assume the cluster has more than 1G per CPU available on the nodes? – damienfrancois Commented Nov 22, 2024 at 8:51