bmod
fsched(版本fsched-10.61 +)支持对运行中的任务更改资源请求,目前支持更改的参数包括:
-R "rusage[mem=xxx]"
作用
通过更改运行中的任务的资源请求,可以在任务运行时减少多申请的资源,让节点上的其它因为资源不足pending的job能够运行
注意
- 任务运行时减少申请的资源,不会减少任务实际使用的资源,所以其它job运行后可能会使节点资源负载过高
集群配置
修改以下配置,以使用支持更改运行中任务资源请求的插件select/cons_tres_ex
SelectType=select/cons_tres_ex
示例
[root@head-1 ~]# bsub -R "rusage[mem=8000]" sleep 300
Job <303> is submitted to default queue.
[root@head-1 ~]# bsub -R "rusage[mem=8000]" sleep 300
Job <304> is submitted to default queue.
[root@head-1 ~]# bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
303 root RUN partition- head-1 compute-1 sleep 300 Dec 5 14:52
304 root PEND partition- head-1 sleep 300 Dec 5 14:52
[root@head-1 ~]# bmod -R "rusage[mem=1000]" 303
Parameters of job <303> are being changed
[root@head-1 ~]# bjobs -l
Job <303>, User <root>, Project <*>, Status <RUN>, Queue <partition-9C3RA>, Command <sleep
300>
Dec 5 14:52: Submitted from host <head-1>, CWD </root>, Output File </dev/null>, Error File
</dev/null>, Requested Resources <rusage[mem=1000]>;
Dec 5 14:52: Started 1 Task(s) on Host(s) <compute-1>, Allocated 1 Slot(s) on Host(s)
<compute-1>, Execution Home </root>, Execution CWD </root>
Dec 5 14:55: Resource usage collected.
MEM: 0 Mbytes; NTHREAD: 3
PGID: 16165; PIDS: 16165
PGID: 16172; PIDS: 16172 16174
MEMORY USAGE:
MAX MEM: 0 Mbytes
------------------------------------------------------------------------------
Job <304>, User <root>, Project <*>, Status <RUN>, Queue <partition-9C3RA>, Command <sleep
300>
Dec 5 14:52: Submitted from host <head-1>, CWD </root>, Output File </dev/null>, Error File
</dev/null>, Requested Resources <rusage[mem=8000]>;
Dec 5 14:53: Started 1 Task(s) on Host(s) <compute-1>, Allocated 1 Slot(s) on Host(s)
<compute-1>, Execution Home </root>, Execution CWD </root>
Dec 5 14:55: Resource usage collected.
MEM: 0 Mbytes; NTHREAD: 3
PGID: 16192; PIDS: 16192
PGID: 16199; PIDS: 16199 16201
MEMORY USAGE:
MAX MEM: 0 Mbytes