slurm - Submitting job with multiple parallel steps

I am trying to set up jobs with multiple steps, essentially running many independent copies of the same program on a single core each time. I decided to use this approach instead of job arrays, as there is a limit of 20 jobs per user on the cluster I am accessing, while the maximum steps per job is set to the default 40000. An example batch script looks like:

#!/bin/sh

#SBATCH --partition parallel
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1
#SBATCH --mem=1000M
#SBATCH --job-name test
#SBATCH --output test.out

for ((i=0; i<$SLURM_NTASKS; i++))
do
    srun -N1 -n1 --mem-per-cpu=10M --exact -u sleep 1200 &
done

wait

From my understanding, the expected behavior of the above script is that after the allocation of resources, 100 job steps will be launched in parallel, each one taking up a single CPU. I have also included explicit memory allocation and the --exact flag, as suggested on a similar post to prevent the first srun from taking up the entire allocated memory: parallel but different Slurm srun job step invocations not working

Nevertheless, I still end up getting the message Step creation temporarily disabled, retrying (Requested nodes are busy), after a few (10-20) job steps start running. Since all necessary resources are being allocated and memory is being evenly distributed (verified by checking with sacct), what could be preventing all the jobs from running at the same time?

PS. I am adding below a typical sacct output. Everything seems to follow the expected behavior. The only things that are not clear to me are:

The fact that the .batch step seems to take up many CPUs and more memory than is actually allocated to it
The fact that even though steps are allocated on 5 nodes, almost all of them are concentrated in only one node. This is the typical behavior of all my tests so far, so I cannot consider it a coincidence.

          JobID     MaxRSS                                AllocTRES             NodeList 
--------------- ---------- ---------------------------------------- -------------------- 
         624245                billing=441,cpu=100,mem=1500M,node=5  ibiscohpc-wn[26-30] 
   624245.batch    408612K                   cpu=22,mem=330M,node=1       ibiscohpc-wn26 
       624245.0       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.1       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn27 
       624245.2       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn28 
       624245.3       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn29 
       624245.4       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn30 
       624245.5       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.6       636K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.7       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.8       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.9       636K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.10       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.11       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.12       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.13       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.14       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.15       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.16       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.17       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.18       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.19       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.20       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.21       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.22       636K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.23       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.24       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.25       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26

#!/bin/sh

#SBATCH --partition parallel
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1
#SBATCH --mem=1000M
#SBATCH --job-name test
#SBATCH --output test.out

for ((i=0; i<$SLURM_NTASKS; i++))
do
    srun -N1 -n1 --mem-per-cpu=10M --exact -u sleep 1200 &
done

wait

PS. I am adding below a typical sacct output. Everything seems to follow the expected behavior. The only things that are not clear to me are:

The fact that the .batch step seems to take up many CPUs and more memory than is actually allocated to it
The fact that even though steps are allocated on 5 nodes, almost all of them are concentrated in only one node. This is the typical behavior of all my tests so far, so I cannot consider it a coincidence.

          JobID     MaxRSS                                AllocTRES             NodeList 
--------------- ---------- ---------------------------------------- -------------------- 
         624245                billing=441,cpu=100,mem=1500M,node=5  ibiscohpc-wn[26-30] 
   624245.batch    408612K                   cpu=22,mem=330M,node=1       ibiscohpc-wn26 
       624245.0       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.1       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn27 
       624245.2       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn28 
       624245.3       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn29 
       624245.4       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn30 
       624245.5       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.6       636K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.7       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.8       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
       624245.9       636K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.10       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.11       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.12       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.13       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.14       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.15       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.16       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.17       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.18       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.19       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.20       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.21       632K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.22       636K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.23       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.24       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26 
      624245.25       628K                     cpu=1,mem=10M,node=1       ibiscohpc-wn26

Share Improve this question edited Nov 21, 2024 at 11:59 Mark Rotteveel 109k229 gold badges156 silver badges221 bronze badges asked Nov 20, 2024 at 16:24 Christos 112 bronze badges

Can you try replacing --mem=1000M with --mem-per-cpu=15M in the header of the submission script? the --mem option is per job and does not necessarily ensure that each task has access to the same portion of allocated memory on each node. Or, if the nodes have more than 100 CPUs, use --nodes=1 to run all tasks on the same node, in which case --mem does not have that limitation – damienfrancois Commented Nov 21, 2024 at 10:39
I have already tried it, it doesn't seem to do the trick. If I contain all the tasks in one node (with a maximum of 96 CPUs) things work as expected, but it is really important for me to be able to share tasks among nodes to minimize waiting time since there are quite a lot of people using the cluster. This seems like a very basic feature so I feel like I am missing something obvious here. – Christos Commented Nov 21, 2024 at 11:04
The memory requirements might be too tight, they do not take into account the memory required by the job submission script itself. Can you try with --mem-per-cpu=1G in the submission script? I assume the cluster has more than 1G per CPU available on the nodes? – damienfrancois Commented Nov 22, 2024 at 8:51
The cluster has around 4GB of memory per CPU if I'm not mistaken, but increasing the memory allocation that didn't seem to change the behavior. – Christos Commented Nov 23, 2024 at 10:26

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I have not been able to find a satisfying answer to submitting tasks on multiple nodes using job steps. However I found that in my case (multiple identical runs) what works really well is to submit only one job step split in many tasks. The batch script would then look like:

#!/bin/sh

#SBATCH --partition parallel
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100M
#SBATCH --job-name test
#SBATCH --output test.out

srun -n100 -u exec.sh

with the executable script exec.sh containing expressions with the variable $SLURM_PROCID to differentiate between the tasks. For example:

#!/bin/sh

echo $SLURM_PROCID
sleep 1200

This will result in the desired behavior, but from what I understand it has some drawbacks compared to submitting separate job steps when it comes to the independently controlling each task. However, until a better alternative is found, this is the only approach that seems to work for this use case.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

slurm - Submitting job with multiple parallel steps - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)