使用 CCE AITraining Operator 实现弹性容错训练
更新时间:2022-12-01
在CCE中使用AI Training Operator与Horovod训练框架实现分布式训练的弹性与容错功能。
模型训练是深度学习中重要的环节,模型复杂的训练任务有运行时间长、算力需求大的特征。传统分布式深度学习任务中,一旦提交训练任务,无法在运行中动态调整Workers的数量。通过弹性模型训练,可以为深度学习的模型训练任务提供动态修改Workers数量的能力。同时容错的功能能保证在训练任务异常情况下如节点异常故障导致Pod驱逐,重新为异常的Worker调度新的节点继续执行任务,而不会某因某个Worker异常中断整个训练任务。
环境需求
- CCE中安装AI Training Operator组件。
- 使用Horovod/paddlepaddle作为分布式训练框架。
组件安装
- 在CCE控制台安装AITrainingOperator组件。
- 勾选CCE Training确认安装。
任务提交
在CCE集群控制台 → 云原生 AI → 任务管理中提交任务,选择框架:AITrainingJob
,若需要支持任务容错需要勾选容错功能以开启(弹性任务训练也需开启容错支持)。
生成弹性容错训练任务YAML模版:
apiVersion: kongming.cce.baiudbce.com/v1
kind: AITrainingJob
metadata:
name: test-horovod-elastic
namespace: default
spec:
cleanPodPolicy: None
completePolicy: Any
failPolicy: Any
frameworkType: horovod
faultTolerant: true
plugin:
ssh:
- ""
discovery:
- ""
priority: normal
replicaSpecs:
launcher:
completePolicy: All
failPolicy: Any
faultTolerantPolicy:
- exitCodes: 129,101
restartPolicy: ExitCode
restartScope: Pod
- exceptionalEvent: nodeNotReady
restartPolicy: OnNodeFail
restartScope: Pod
maxReplicas: 1
minReplicas: 1
replicaType: master
replicas: 1
restartLimit: 100
restartPolicy: OnNodeFailWithExitCode
restartScope: Pod
restartTimeLimit: 60
restartTimeout: 864000
template:
metadata:
creationTimestamp: null
spec:
initContainers:
- args:
- --barrier_roles=trainer
- --incluster
- --name=$(TRAININGJOB_NAME)
- --namespace=$(TRAININGJOB_NAMESPACE)
- --dns_check_svc=kube-dns
image: registry.baidubce.com/cce-plugin-dev/jobbarrier:v0.9-1
imagePullPolicy: IfNotPresent
name: job-barrier
restartPolicy: Never
schedulerName: volcano
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
securityContext: {}
containers:
- command:
- /bin/bash
- -c
- export HOROVOD_GLOO_TIMEOUT_SECONDS=300 && horovodrun -np 3 --min-np=1 --max-np=5 --verbose --log-level=DEBUG --host-discovery-script /etc/edl/discover_hosts.sh python /horovod/examples/elastic/pytorch/pytorch_synthetic_benchmark_elastic.py --num-iters=1000
env:
image: registry.baidubce.com/cce-plugin-dev/horovod:master-0.2.0
imagePullPolicy: Always
name: aitj-0
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
volumeMounts:
- mountPath: /dev/shm
name: cache-volume
dnsPolicy: ClusterFirstWithHostNet
terminationGracePeriodSeconds: 30
volumes:
- emptyDir:
medium: Memory
sizeLimit: 1Gi
name: cache-volume
trainer:
completePolicy: None
failPolicy: None
faultTolerantPolicy:
- exceptionalEvent: "nodeNotReady,PodForceDeleted"
restartPolicy: OnNodeFail
restartScope: Pod
maxReplicas: 5
minReplicas: 1
replicaType: worker
replicas: 3
restartLimit: 100
restartPolicy: OnNodeFailWithExitCode
restartScope: Pod
restartTimeLimit: 60
restartTimeout: 864000
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- /bin/bash
- -c
- /usr/sbin/sshd && sleep infinity
image: registry.baidubce.com/cce-plugin-dev/horovod:master-0.2.0
imagePullPolicy: Always
name: aitj-0
env:
- name: NVIDIA_DISABLE_REQUIRE
value: "true"
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
resources:
limits:
baidu.com/v100_32g_cgpu: "1"
baidu.com/v100_32g_cgpu_core: "20"
baidu.com/v100_32g_cgpu_memory: "4"
requests:
baidu.com/v100_32g_cgpu: "1"
baidu.com/v100_32g_cgpu_core: "20"
baidu.com/v100_32g_cgpu_memory: "4"
volumeMounts:
- mountPath: /dev/shm
name: cache-volume
dnsPolicy: ClusterFirstWithHostNet
terminationGracePeriodSeconds: 300
volumes:
- emptyDir:
medium: Memory
sizeLimit: 1Gi
name: cache-volume
schedulerName: volcano
指定3个Worker并提交运行:
NAME READY STATUS RESTARTS AGE
test-horovod-elastic-launcher-vwvb8-0 0/1 Init:0/1 0 6s
test-horovod-elastic-trainer-q7gmp-0 1/1 Running 0 7s
test-horovod-elastic-trainer-spkb8-1 1/1 Running 0 7s
test-horovod-elastic-trainer-sxf6s-2 1/1 Running 0 7s
弹性场景
在CCE控制台中对正在运行的训练任务动态修改Worker数量,并指定扩容超时时间。
或直接在集群中修改此CR的YAML,修改spec.replicaSpecs.trainer.replicas
的值指定期望Worker数量来执行弹性。
可以看到有相应的扩容事件,集群中也创建了新的Worker Pod加入运行。
status:
RestartCount:
trainer: 0
conditions:
- lastProbeTime: "2022-01-14T09:01:52Z"
lastTransitionTime: "2022-01-14T09:01:52Z"
message: all pods are waiting for scheduling
reason: TrainingJobPending
status: "False"
type: Pending
- lastProbeTime: "2022-01-14T09:01:53Z"
lastTransitionTime: "2022-01-14T09:01:53Z"
message: pods [test-horovod-elastic-launcher-vk9c2-0] creating containers
reason: TrainingJobCreating
status: "False"
type: Creating
- lastProbeTime: "2022-01-14T09:02:27Z"
lastTransitionTime: "2022-01-14T09:02:27Z"
message: all pods are running
reason: TrainingJobRunning
status: "False"
type: Running
- lastProbeTime: "2022-01-14T09:06:16Z"
lastTransitionTime: "2022-01-14T09:06:16Z"
message: trainingJob default/test-horovod-elastic scaleout Operation scaleout
scale num 1 scale pods [test-horovod-elastic-trainer-vdkk6-3], replicas name
trainer job version 1
status: "False"
type: Scaling
- lastProbeTime: "2022-01-14T09:06:20Z"
lastTransitionTime: "2022-01-14T09:06:20Z"
message: all pods are running
reason: TrainingJobRunning
status: "True"
type: Running
NAME READY STATUS RESTARTS AGE
test-horovod-elastic-launcher-vk9c2-0 1/1 Running 0 7m4s
test-horovod-elastic-trainer-4zzk4-0 1/1 Running 0 7m5s
test-horovod-elastic-trainer-b5rc2-2 1/1 Running 0 7m5s
test-horovod-elastic-trainer-kdjq2-1 1/1 Running 0 7m5s
test-horovod-elastic-trainer-vdkk6-3 1/1 Running 0 2m40s
容错场景
在CCE中创建训练任务并开启容错后,会在提交的YAML的 faultTolorencePolicy
字段中指定容错策略如下:
faultTolerantPolicy:
- exceptionalEvent: nodeNotReady,PodForceDeleted
restartPolicy: OnNodeFail
restartScope: Pod
当Pod因指定退出码异常退出、节点NotReady造成的Pod驱逐、Pod被强行删除场景时,Operator会自动拉起新的训练Pod替代错误的Pod继续完成训练任务。
如当强行删除一个Pod后,最终会创建新的Pod补充进来恢复原本的4个训练实例:
➜ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
test-horovod-elastic-launcher-vk9c2-0 1/1 Running 0 7m59s
test-horovod-elastic-trainer-4zzk4-0 1/1 Terminating 0 8m
test-horovod-elastic-trainer-b5rc2-2 1/1 Running 0 8m
test-horovod-elastic-trainer-kdjq2-1 1/1 Running 0 8m
test-horovod-elastic-trainer-vdkk6-3 1/1 Running 0 3m35s
test-horovod-elastic-trainer-4zzk4-0 0/1 Terminating 0 8m7s
test-horovod-elastic-trainer-4zzk4-0 0/1 Terminating 0 8m8s
test-horovod-elastic-trainer-4zzk4-0 0/1 Terminating 0 8m8s
test-horovod-elastic-trainer-htbz4-0 0/1 Pending 0 0s
test-horovod-elastic-trainer-htbz4-0 0/1 Pending 0 1s
test-horovod-elastic-trainer-htbz4-0 0/1 Pending 0 1s
test-horovod-elastic-trainer-htbz4-0 0/1 Pending 0 1s
test-horovod-elastic-trainer-htbz4-0 0/1 ContainerCreating 0 1s
test-horovod-elastic-trainer-htbz4-0 1/1 Running 0 3s