最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

slurm - How to set cloud nodes that become DOWN to POWER_DOWN - Stack Overflow

programmeradmin1浏览0评论

Given that cloud nodes can be easily destroyed and rebuild when they encounter a serious issue that justifies Slurm setting it down, is it possible to trigger a script whenever a node is set to DOWN?

We would like to avoid manual intervention by the cluster users as much as possible and I feel like by automatically resolving DOWN by terminating and restarting the server [1] would be a simple fix for that. We do not really have big issues with failing nodes, but it feels like a good precaution.

I am aware of ReturnToService but I don't think it fits my issue given that it allows nodes to return to service before they are shut down hence the issue might still be there as the node is still the same.

[1] Restarting in this context means: Triggering the SuspendProgram which deletes the server and sets the node to RESUME which then allows Slurm to include it in the scheduling process again and once a job is scheduled onto the node, the ResumeProgram will create the server and install required packages.

Given that cloud nodes can be easily destroyed and rebuild when they encounter a serious issue that justifies Slurm setting it down, is it possible to trigger a script whenever a node is set to DOWN?

We would like to avoid manual intervention by the cluster users as much as possible and I feel like by automatically resolving DOWN by terminating and restarting the server [1] would be a simple fix for that. We do not really have big issues with failing nodes, but it feels like a good precaution.

I am aware of ReturnToService but I don't think it fits my issue given that it allows nodes to return to service before they are shut down hence the issue might still be there as the node is still the same.

[1] Restarting in this context means: Triggering the SuspendProgram which deletes the server and sets the node to RESUME which then allows Slurm to include it in the scheduling process again and once a job is scheduled onto the node, the ResumeProgram will create the server and install required packages.

Share Improve this question asked Jan 21 at 11:12 NatanNatan 1,0554 gold badges16 silver badges38 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

Unfortunately, there is no direct technqiue within (or provided by) Slurm. I believe this is intentional since the down state can be due to any reason and admins have better knowledge of how to deal with down nodes (based on what causes the issue). (In my opinion) Basically, the logic is seperated from Slurm to make life easier for admins.

Nevertheless, you can write your own simple bash script (which runs periodically) to achieve your goals.

#!/usr/bin/env bash


DOWN_NODES=$(sinfo -h -r -N -o "%N %T" | awk '$2=="down" {print $1}')

for node in $DOWN_NODES; do
    echo "Node $node is DOWN; setting it to POWER_DOWN..."
    scontrol update NodeName=$node State=POWER_DOWN Reason="auto-down"
done

This script will check for the down nodes and if there are any DOWN nodes, it would call the scontrol command to update the node state to POWER_DOWN. If configured correctly in the configuration file, this would trigger the SuspendProgram script.

To periodically call (poll) your script, you can either use cron or systemd timer.

发布评论

评论列表(0)

  1. 暂无评论