Given that cloud nodes can be easily destroyed and rebuild when they encounter a serious issue that justifies Slurm setting it down, is it possible to trigger a script whenever a node is set to DOWN?
We would like to avoid manual intervention by the cluster users as much as possible and I feel like by automatically resolving DOWN by terminating and restarting the server [1] would be a simple fix for that. We do not really have big issues with failing nodes, but it feels like a good precaution.
I am aware of ReturnToService
but I don't think it fits my issue given that it allows nodes to return to service before they are shut down hence the issue might still be there as the node is still the same.
[1] Restarting in this context means: Triggering the SuspendProgram which deletes the server and sets the node to RESUME which then allows Slurm to include it in the scheduling process again and once a job is scheduled onto the node, the ResumeProgram will create the server and install required packages.
Given that cloud nodes can be easily destroyed and rebuild when they encounter a serious issue that justifies Slurm setting it down, is it possible to trigger a script whenever a node is set to DOWN?
We would like to avoid manual intervention by the cluster users as much as possible and I feel like by automatically resolving DOWN by terminating and restarting the server [1] would be a simple fix for that. We do not really have big issues with failing nodes, but it feels like a good precaution.
I am aware of ReturnToService
but I don't think it fits my issue given that it allows nodes to return to service before they are shut down hence the issue might still be there as the node is still the same.
[1] Restarting in this context means: Triggering the SuspendProgram which deletes the server and sets the node to RESUME which then allows Slurm to include it in the scheduling process again and once a job is scheduled onto the node, the ResumeProgram will create the server and install required packages.
Share Improve this question asked Jan 21 at 11:12 NatanNatan 1,0554 gold badges16 silver badges38 bronze badges1 Answer
Reset to default 1Unfortunately, there is no direct technqiue within (or provided by) Slurm. I believe this is intentional since the down state can be due to any reason and admins have better knowledge of how to deal with down nodes (based on what causes the issue). (In my opinion) Basically, the logic is seperated from Slurm to make life easier for admins.
Nevertheless, you can write your own simple bash script (which runs periodically) to achieve your goals.
#!/usr/bin/env bash
DOWN_NODES=$(sinfo -h -r -N -o "%N %T" | awk '$2=="down" {print $1}')
for node in $DOWN_NODES; do
echo "Node $node is DOWN; setting it to POWER_DOWN..."
scontrol update NodeName=$node State=POWER_DOWN Reason="auto-down"
done
This script will check for the down nodes and if there are any DOWN
nodes, it would call the scontrol
command to update the node state to POWER_DOWN
. If configured correctly in the configuration file, this would trigger the SuspendProgram
script.
To periodically call (poll) your script, you can either use cron
or systemd
timer.