计算任务资源监控
SonmiHPC集群在0.7.2及以上版本中,提供了对集群中单个任务在多个节点中使用占用的资源进行监控的模块。用户可以通过该模块来查看在整个计算的过程中对系统资源的使用情况。
下面以提交一个LAMMPS分子动力学计算为例,来讲解该功能的使用。
- 首先提交一个跨3个节点的LAMMPS计算任务,然后通过以下的命令查看任务队列详情。
[sonmi@sonmi lammps]$ sonmictl job info
+-------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+--------+-------+---------+-------+------+-----------------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+--------+-------+---------+-------+------+-----------------------+
| 18 | sonmi | lammps | sonmi | Running | 3 | 1m0s | compute-0-[0-1],sonmi |
+-------+-----------+--------+-------+---------+-------+------+-----------------------+
[sonmi@sonmi lammps]$ sonmictl job info
+-------------------------------------------------------------------------------------+
| LOCAL CLUSTER JOB LIST |
+-------+-----------+--------+-------+---------+-------+------+-----------------------+
| JOBID | PARTITION | NAME | USER | STATE | NODES | TIME | NODELIST |
+-------+-----------+--------+-------+---------+-------+------+-----------------------+
| 18 | sonmi | lammps | sonmi | Running | 3 | 1m0s | compute-0-[0-1],sonmi |
+-------+-----------+--------+-------+---------+-------+------+-----------------------+
- 执行以下的命令,来查看该任务在每个节点中的系统资源使用情况。
[sonmi@sonmi lammps]$ sonmictl job detail 18
+----------------------------------------------------------------+
| JOB RESOURCE USAGE DETAIL: 18 |
+-------------+-------------+--------+------+---------+----------+
| NODE | CPU PERCENT | MEMORY | SWAP | IO READ | IO WRITE |
+-------------+-------------+--------+------+---------+----------+
| sonmi | 719 | 928MB | 0B | 25MB | 0B |
| compute-0-0 | 693 | 958MB | 0B | 288KB | 84KB |
| compute-0-1 | 715 | 921MB | 0B | 39MB | 0B |
+-------------+-------------+--------+------+---------+----------+
| TOTAL | 2127 | 2.7GB | 0B | 65MB | 84KB |
+-------------+-------------+--------+------+---------+----------+
[sonmi@sonmi lammps]$ sonmictl job detail 18
+----------------------------------------------------------------+
| JOB RESOURCE USAGE DETAIL: 18 |
+-------------+-------------+--------+------+---------+----------+
| NODE | CPU PERCENT | MEMORY | SWAP | IO READ | IO WRITE |
+-------------+-------------+--------+------+---------+----------+
| sonmi | 719 | 928MB | 0B | 25MB | 0B |
| compute-0-0 | 693 | 958MB | 0B | 288KB | 84KB |
| compute-0-1 | 715 | 921MB | 0B | 39MB | 0B |
+-------------+-------------+--------+------+---------+----------+
| TOTAL | 2127 | 2.7GB | 0B | 65MB | 84KB |
+-------------+-------------+--------+------+---------+----------+
- 也可以通过如下的命令来查看集群所有任务的系统资源使用情况。
sonmictl job detail
sonmictl job detail
打印出来的结果中,各个数据列的具体意义如下:
- CPU PERCENT: 该节点中该任务的所有进程/线程总的CPU使用率
- MEMORY: 该节点中该任务的所有进程/线程总共占用的虚拟内存
- SWAP: 该节点中该任务的所有进程/线程总共使用的交换内存
- IO READ: 该节点中该任务的所有进程/线程在存储层读取的总字节数
- IO WRITE: 该节点中该任务的所有进程/线程在存储层写入的总字节数
最后TOTAL行为该任务在整个集群总占用的各类资源总数。