计算任务暂停及恢复运行
Slurm 使用 CGroup 技术,支持用户将正在运行的任务进行暂停/冻结,并在需要的时候恢复该任务进程。该方式不会让已经运行的任务丢失进度,因此非常适合于在需要临时下线计算节点的时候对运行在该节点的任务进行暂停,重新上线后可以从原来的进度恢复运行。
注意:暂停及恢复任务需要管理员权限!
使用说明
root 用户可以通过如下的命令来暂停正在运行的任务:
shell
scontrol suspend <jobid_list>
scontrol suspend <jobid_list>
需要恢复任务运行时则执行下面的命令即可:
shell
scontrol resume <jobid_list>
scontrol resume <jobid_list>
其中的 jobid_list 为任务ID列表,即可以为单一的任务ID,也可以是如下的任务范围:
shell
scontrol suspend 1 # 暂停ID为1的任务
scontrol suspend 1-100 # 暂停ID为1到100的任务
scontrol suspend 1-20,30-100,101 # 暂停ID为1到20,30到100以及101的任务
scontrol suspend 1 # 暂停ID为1的任务
scontrol suspend 1-100 # 暂停ID为1到100的任务
scontrol suspend 1-20,30-100,101 # 暂停ID为1到20,30到100以及101的任务
示例
通过如下的命令来提交一个 LAMMPS 稳定性测试:
shell
sonmi-run-test-suite select compute-0-[0-1]
sonmi-run-test-suite submit stability lammps
sonmi-run-test-suite select compute-0-[0-1]
sonmi-run-test-suite submit stability lammps
可以看到任务正在运行:
[sonmi@sonmi lammps]$ sonmictl job info
+------------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+------------------+-------+---------+-------+-------+-----------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+------------------+-------+---------+-------+-------+-----------------+
| 5 | sonmi | stability-lammps | sonmi | Running | 2 | 8m42s | compute-0-[0-1] |
+-------+-----------+------------------+-------+---------+-------+-------+-----------------+
[sonmi@sonmi lammps]$ sonmictl job info
+------------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+------------------+-------+---------+-------+-------+-----------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+------------------+-------+---------+-------+-------+-----------------+
| 5 | sonmi | stability-lammps | sonmi | Running | 2 | 8m42s | compute-0-[0-1] |
+-------+-----------+------------------+-------+---------+-------+-------+-----------------+
通过如下的命令冻结暂停任务:
shell
scontrol suspend 5
scontrol suspend 5
可以看到任务已经暂停运行:
[sonmi@sonmi lammps]$ sonmictl job info
+-------------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+------------------+-------+-----------+-------+------+-----------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+------------------+-------+-----------+-------+------+-----------------+
| 5 | sonmi | stability-lammps | sonmi | Suspended | 2 | -- | compute-0-[0-1] |
+-------+-----------+------------------+-------+-----------+-------+------+-----------------+
[sonmi@sonmi lammps]$ sonmictl job info
+-------------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+------------------+-------+-----------+-------+------+-----------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+------------------+-------+-----------+-------+------+-----------------+
| 5 | sonmi | stability-lammps | sonmi | Suspended | 2 | -- | compute-0-[0-1] |
+-------+-----------+------------------+-------+-----------+-------+------+-----------------+
root 用户执行如下的命令进行恢复任务运行:
shell
scontrol resume 5
scontrol resume 5
可以看到任务已经恢复运行:
[sonmi@sonmi lammps]$ sonmictl job info
+-------------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+------------------+-------+---------+-------+--------+-----------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+------------------+-------+---------+-------+--------+-----------------+
| 5 | sonmi | stability-lammps | sonmi | Running | 2 | 11m59s | compute-0-[0-1] |
+-------+-----------+------------------+-------+---------+-------+--------+-----------------+
[sonmi@sonmi lammps]$ sonmictl job info
+-------------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+------------------+-------+---------+-------+--------+-----------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+------------------+-------+---------+-------+--------+-----------------+
| 5 | sonmi | stability-lammps | sonmi | Running | 2 | 11m59s | compute-0-[0-1] |
+-------+-----------+------------------+-------+---------+-------+--------+-----------------+