开发任务模板
此文档讲解如何使用分布式训练任务开发自己的业务任务模板。
在进行工作流编排前,开发者需要先开发单个任务,将工作流中的每个单个任务在分布式训练任务中进行提交验证,一旦单个任务顺利跑通,继续将任务制作为任务模板即可在工作流中被复用,沉淀为AI资产。
任务模板开发
工作流的CustomTask对应百舸的分布式训练任务,所以百舸CustomTask类型的任务模板支持的参数与分布式训练任务完全一致。
在分布式训练任务中跑通任务
- 任务代码开发完成后,进入百舸控制台,点击左侧【分布式训练】,点击页面的【创建任务】填写表单验证任务。创建训练任务详细步骤参考 创建训练任务
- 任务跑通后,将任务参数信息整理为工作流任务基础模板,即将创建任务时填写的表单项使用接口参数方式表达,任务参数可以在任务详情页查看。
- 以下模板主要以数据处理场景进行演示
- 完整的任务参数模板可参考 任务模板示例/固定参数任务模板
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: custom-task-demo
5 type: CustomTask
6 spec:
7 queue: aihcq-h1plvpzb5gh0
8 jobType: PyTorchJob
9 command: sleep 5s
10 priority: normal
11 jobSpec:
12 image: registry.baidubce.com/aihc-aiak/aiak-megatron:ubuntu20.04-cu11.8-torch1.14.0-py38_v1.2.7.12_release
13 replicas: 1
14 resources: []
15 envs:
16 - name: SOURCE_DATA_DIR
17 value: bos://my-bucket/datasets/source
18 - name: TARGET_DATA_DIR
19 value: bos://my-bucket/datasets/target
20 datasources:
21 - type: pfs
22 name: pfs-pxE6jz
23 sourcePath: /
24 mountPath: /mnt/cluster
25 - type: bos
26 name: ""
27 sourcePath: bos://my-bucket/datasets
28 mountPath: /mnt/bos/data
29tasks:
30 - name: task_a
31 taskTemplateName: custom-task-demo
设计任务模板参数
将任务中可以外部传入的参数设置为input参数,推荐的设计原则:
- 如果不同任务使用的资源量不同,将资源参数提取为input
- 所有的任务逻辑控制参数,放置在envs中传入
上述示例中我们将环境变量中定义的数据目录(SOURCE_DATA_DIR、TARGET_DATA_DIR)、执行命令(command)、BOS挂载地址设置为模板的可传入参数
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: custom-task-demo
5 type: CustomTask
6 inputs:
7 - name: command # 参数名
8 type: string # 参数类
9 hint: 执行命令 # 参数描述
10 - name: source_data_dir # 参数名
11 type: string # 参数类型
12 hint: 源数据目录 # 参数描述
13 - name: target_data_dir # 参数名
14 type: string # 参数类型
15 hint: 目标数据目录 # 参数描述
16 - name: bos_dir # 参数名
17 type: string # 参数类型
18 hint: BOS存储地址 # 参数描述
19 spec:
20 queue: aihcq-h1plvpzb5gh0
21 jobType: PyTorchJob
22 command: '{{inputs.parameters.command}}'
23 jobSpec:
24 replicas: 1
25 image: registry.baidubce.com/aihc-aiak/aiak-megatron:ubuntu20.04-cu11.8-torch1.14.0-py38_v1.2.7.12_release
26 resources: []
27 envs:
28 - name: SOURCE_DATA_DIR
29 value: '{{inputs.parameters.source_data_dir}}'
30 - name: TARGET_DATA_DIR
31 value: '{{inputs.parameters.target_data_dir}}'
32 labels: []
33 datasources:
34 - type: bos
35 name: ""
36 sourcePath: '{{inputs.parameters.bos_dir}}'
37 mountPath: /mnt/bos/data
38tasks:
39 - name: task_a
40 taskTemplateName: custom-task-demo
41 inputs:
42 - name: command
43 value: echo "This is a defined command."
44 - name: source_data_dir
45 value: bos://my-bucket/datasets/source
46 - name: target_data_dir
47 value: bos://my-bucket/datasets/target
48 - name: bos_dir
49 value: bos://my-bucket/datasets
在工作流中使用任务模板
以下示例设置job-1打印 This is job_a. ,job-2设置打印 This is job_b.
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: custom-task-demo
5 type: CustomTask
6 inputs:
7 - name: command # 参数名
8 type: string # 参数类
9 hint: 执行命令 # 参数描述
10 - name: source_data_dir # 参数名
11 type: string # 参数类型
12 hint: 源数据目录 # 参数描述
13 - name: target_data_dir # 参数名
14 type: string # 参数类型
15 hint: 目标数据目录 # 参数描述
16 - name: bos_dir # 参数名
17 type: string # 参数类型
18 hint: BOS存储地址 # 参数描述
19 spec:
20 queue: aihcq-h1plvpzb5gh0
21 jobType: PyTorchJob
22 command: '{{inputs.parameters.command}}'
23 jobSpec:
24 replicas: 1
25 image: registry.baidubce.com/aihc-aiak/aiak-megatron:ubuntu20.04-cu11.8-torch1.14.0-py38_v1.2.7.12_release
26 resources: []
27 envs:
28 - name: SOURCE_DATA_DIR
29 value: '{{inputs.parameters.source_data_dir}}'
30 - name: TARGET_DATA_DIR
31 value: '{{inputs.parameters.target_data_dir}}'
32 labels: []
33 datasources:
34 - type: bos
35 name: ""
36 sourcePath: '{{inputs.parameters.bos_dir}}'
37 mountPath: /mnt/bos/data
38tasks:
39 - name: task_a
40 taskTemplateName: custom-task-demo
41 inputs:
42 - name: command
43 value: echo "This is job_a."
44 - name: source_data_dir
45 value: bos://my-bucket/datasets/source/a
46 - name: target_data_dir
47 value: bos://my-bucket/datasets/target/a
48 - name: bos_dir
49 value: bos://my-bucket/datasets/a
50 - name: task_b
51 taskTemplateName: custom-task-demo
52 inputs:
53 - name: command
54 value: echo "This is job_b."
55 - name: source_data_dir
56 value: bos://my-bucket/datasets/source/b
57 - name: target_data_dir
58 value: bos://my-bucket/datasets/target/b
59 - name: bos_dir
60 value: bos://my-bucket/datasets/b
61 dependencies:
62 - task_a
以上我们讲解了如何开发一个简单的任务模板,更多高级的任务参数你可以继续看后续任务示例模板。
任务模板示例
以下提供通用任务模板供参考,实际使用时可以根据任务参数需求自行删减参数。
通用
通用任务模板
参数说明:
| 参数名称 | 说明 |
|---|---|
| command | 任务的执行命令 |
| source_data_dir | 源数据集的目录 |
| target_data_dir | 数据集的保存目录 |
| bos_dir | 挂载的BOS目录 |
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: custom-task-demo
5 type: CustomTask
6 inputs:
7 - name: command
8 type: string
9 hint: 执行命令
10 - name: source_data_dir
11 type: string
12 hint: 源数据目录
13 - name: target_data_dir
14 type: string
15 hint: 目标数据目录
16 - name: bos_dir
17 type: string
18 hint: 挂载的BOS目录
19 spec:
20 queue: aihcq-h1plvpzb5gh0
21 jobType: PyTorchJob
22 command: '{{inputs.parameters.command}}'
23 jobSpec:
24 replicas: 1
25 image: >-
26 registry.baidubce.com/aihc-aiak/aiak-megatron:ubuntu20.04-cu11.8-torch1.14.0-py38_v1.2.7.12_release
27 resources: []
28 envs:
29 - name: SOURCE_DATA_DIR
30 value: '{{inputs.parameters.source_data_dir}}'
31 - name: TARGET_DATA_DIR
32 value: '{{inputs.parameters.target_data_dir}}'
33 labels: []
34 datasources:
35 - type: bos
36 name: ''
37 sourcePath: '{{inputs.parameters.bos_dir}}'
38 mountPath: /mnt/bos
39tasks:
40 - name: task_a
41 taskTemplateName: custom-task-demo
42 inputs:
43 - name: command
44 value: echo "This is job_a."
45 - name: source_data_dir
46 value: bos://my-bucket/datasets/source
47 - name: target_data_dir
48 value: bos://my-bucket/datasets/target
49 - name: bos_dir
50 value: bos://my-bucket/datasets
分布式训练任务模板
以下是完整的分布式训练任务参数,参数说明参考 分布式训练任务的OpenAPI接口。
开发者可以基于完整参数模板开发自己训练、数据处理等业务任务模板。
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: custom-task-template
5 type: CustomTask
6 spec:
7 queue: aihcq-xxxxx
8 jobType: PyTorchJob
9 command: sleep 30s
10 priority: normal
11 enableBccl: false
12 faultTolerance: true
13 faultToleranceArgs: --enable-replace=true --enable-hang-detection=true
14 --hang-detection-log-timeout-minutes=7
15 --hang-detection-startup-toleration-minutes=15
16 --hang-detection-stack-timeout-minutes=3 --max-num-of-unconditional-retry=2
17 --custom-log-patterns=timeout1 --custom-log-patterns=timeout2
18 retentionPeriod: 1d
19 jobSpec:
20 image: registry.baidubce.com/aihc-aiak/aiak-megatron:ubuntu20.04-cu11.8-torch1.14.0-py38_v1.2.7.12_release
21 imageConfig:
22 username: your-registry-username
23 password: your-registry-password
24 replicas: 2
25 resources:
26 - name: baidu.com/a800_80g_cgpu
27 quantity: 8
28 - name: cpu
29 quantity: 96
30 - name: memory
31 quantity: 512
32 - name: sharedMemory
33 quantity: 64
34 envs:
35 - name: NCCL_DEBUG
36 value: INFO
37 - name: NCCL_IB_DISABLE
38 value: "0"
39 - name: CUDA_VISIBLE_DEVICES
40 value: 0,1,2,3,4,5,6,7
41 enableRDMA: true
42 hostNetwork: false
43 labels:
44 - key: project
45 value: llm-training
46 - key: team
47 value: ai-platform
48 datasources:
49 - type: pfs
50 name: pfs-pxE6jz
51 sourcePath: /
52 mountPath: /mnt/cluster
53 options:
54 readOnly: false
55 - type: hostPath
56 name: host-data
57 sourcePath: /data/shared
58 mountPath: /mnt/host-data
59 options:
60 readOnly: true
61 - type: bos
62 name: ""
63 sourcePath: bos://my-bucket/datasets/
64 mountPath: /mnt/bos-data
65 options:
66 readOnly: true
67 - type: cfs
68 name: cfs-instance-id
69 sourcePath: /
70 mountPath: /mnt/cfs-data
71 options:
72 readOnly: false
73 - type: rapidfs
74 name: rapidfs-instance-id
75 sourcePath: /
76 mountPath: /mnt/rapidfs-data
77 options:
78 readOnly: false
79 - type: dataset
80 name: dataset-id
81 sourcePath: /
82 mountPath: /mnt/dataset
83 options:
84 readOnly: true
85 tensorboardConfig:
86 enable: true
87 logPath: /mnt/cluster/tensorboard-logs
88 alertConfig:
89 instanceId: your-cluster-monitor-instance-id
90 alertItems:
91 - jobRunning
92 - jobFT
93 - nodeFT
94 - jobFailed
95 - jobSucceed
96 - jobHang
97 for: 0m
98 notifyRuleId: notify-xxxxxxxx
99tasks:
100- name: job-3
101 taskTemplateName: custom-task-template
102- name: job-2
103 taskTemplateName: custom-task-template
在业务实践中可以使用“固定参数+自定义传参”的混合方式将常用的数据处理任务、训练任务、测评任务等固定为任务模板在不同的工作流中复用
数据集下载
从魔搭下载数据集
参数说明:
| 参数名称 | 说明 |
|---|---|
| queue_id | 队列ID,示例:aihcq-h1plvp |
| dataset_name | 魔搭社区的数据集名称,示例:liucong/Chinese-DeepSeek-R1-Distill-data-110k |
| target_data_dir | 数据集的保存目录,示例:bos://my-bucket/datasets |
yaml模板:
1version: v1
2kind: PipelineTemplate
3
4taskTemplates:
5 - name: template-dataset-download-modelscope
6 type: CustomTask
7 inputs:
8 - name: queue_id
9 type: string
10 hint: 队列ID
11 - name: dataset_name
12 type: string
13 hint: 数据集名称
14 - name: target_data_dir
15 type: string
16 hint: 数据集在BOS的存储目录
17 spec:
18 queue: '{{inputs.parameters.queue_id}}'
19 jobType: PyTorchJob
20 command: |
21 #!/bin/sh
22
23 # 检查是否已安装 modelscope
24 if ! command -v modelscope >/dev/null 2>&1; then
25 echo "modelscope 未安装,正在安装..."
26 pip install --user modelscope
27 export PATH="$HOME/.local/bin:$PATH"
28 fi
29
30 # 使用 modelscope CLI 下载数据集
31 echo "正在下载数据集 $DATASET_NAME..."
32 modelscope download --dataset "$DATASET_NAME" --revision master --local_dir /mnt/bos/data
33
34 if [ $? -eq 0 ]; then
35 echo "数据集已成功下载至: /mnt/bos/data"
36 else
37 echo "下载失败!请检查网络、权限或数据集是否存在。"
38 exit 1
39 fi
40 jobSpec:
41 image: registry.baidubce.com/aihcp-public/pytorch:2.7.0-cu12.8.61-py3.12-ubuntu24.04
42 replicas: 1
43 envs:
44 - name: DATASET_NAME
45 value: '{{inputs.parameters.dataset_name}}'
46 datasources:
47 - type: bos
48 name: ""
49 sourcePath: '{{inputs.parameters.target_data_dir}}'
50 mountPath: /mnt/bos/data
51
52tasks:
53 - name: job-1
54 taskTemplateName: template-dataset-download-modelscope
55 inputs:
56 - name: queue_id
57 value: aihcq-h1plvp
58 - name: dataset_name
59 value: liucong/Chinese-DeepSeek-R1-Distill-data-110k
60 - name: target_data_dir
61 value: bos://my-bucket/datasets
从HuggingFace下载数据集
从HuggingFace下载数据集,需要自行保证集群可以正常访问HuggingFace
参数说明:
| 参数名称 | 说明 |
|---|---|
| queue_id | 队列ID,示例:aihcq-h1plvp |
| dataset_name | 魔搭社区的数据集名称,示例:liucong/Chinese-DeepSeek-R1-Distill-data-110k |
| target_data_dir | 数据集的保存目录,示例:bos://my-bucket/datasets |
yaml模板:
1version: v1
2kind: PipelineTemplate
3
4taskTemplates:
5 - name: template-dataset-download-huggingface
6 type: CustomTask
7 inputs:
8 - name: queue_id
9 type: string
10 hint: 队列ID
11 - name: dataset_name
12 type: string
13 hint: 数据集名称
14 - name: target_data_dir
15 type: string
16 hint: 数据集在BOS的存储目录
17 spec:
18 queue: '{{inputs.parameters.queue_id}}'
19 jobType: PyTorchJob
20 command: |
21 #!/bin/sh
22
23 # 构建本地目标路径
24 LOCAL_DIR="/mnt/bos/data/$DATASET_NAME"
25 mkdir -p "$LOCAL_DIR"
26
27 # 检查是否已存在(简单判断)
28 if [ -f "$LOCAL_DIR/.hf_download_complete" ]; then
29 echo "数据集似乎已下载(检测到 .hf_download_complete 标记),跳过。"
30 exit 0
31 fi
32
33 # 安装 huggingface_hub(如果未安装)
34 if ! python3 -c "import huggingface_hub" &> /dev/null; then
35 echo "正在安装 huggingface_hub..."
36 pip install --quiet huggingface_hub
37 fi
38
39 # 创建临时 Python 脚本
40 PY_SCRIPT=$(cat <<EOF
41 from huggingface_hub import snapshot_download
42 import os
43
44 local_dir = os.environ['LOCAL_DIR']
45 repo_id = os.environ['DATASET_NAME']
46
47 print(f"正在从 Hugging Face 下载数据集: {repo_id}")
48 snapshot_download(
49 repo_id=repo_id,
50 local_dir=local_dir,
51 repo_type="dataset",
52 max_workers=8
53 )
54 print("下载完成!")
55 EOF
56 )
57
58 # 导出环境变量供 Python 使用
59 export LOCAL_DIR="$LOCAL_DIR"
60 export DATASET_NAME="$DATASET_NAME"
61
62 # 执行下载
63 echo "开始从 Hugging Face Hub 下载数据集..."
64 python3 -c "$PY_SCRIPT"
65
66 # 检查是否成功并创建标记文件
67 if [ $? -eq 0 ]; then
68 touch "$LOCAL_DIR/.hf_download_complete"
69 echo "数据集已保存至: $LOCAL_DIR"
70 else
71 echo "下载失败。请检查数据集 ID、网络或访问权限。"
72 exit 1
73 fi
74 jobSpec:
75 image: registry.baidubce.com/aihcp-public/pytorch:2.7.0-cu12.8.61-py3.12-ubuntu24.04
76 replicas: 1
77 envs:
78 - name: DATASET_NAME
79 value: '{{inputs.parameters.dataset_name}}'
80 datasources:
81 - type: bos
82 name: ""
83 sourcePath: '{{inputs.parameters.target_data_dir}}'
84 mountPath: /mnt/bos/data
85
86tasks:
87 - name: job-1
88 taskTemplateName: template-dataset-download-huggingface
89 inputs:
90 - name: queue_id
91 value: aihcq-h1plvp
92 - name: dataset_name
93 value: liucong/Chinese-DeepSeek-R1-Distill-data-110k
94 - name: target_data_dir
95 value: bos://my-bucket/datasets
数据转储
从BOS拉取数据到PFS
可用于从BOS动态加载冷数据到PFS上用于训练、数据处理等
参数说明:
| 参数名称 | 说明 |
|---|---|
| queue_id | 队列ID,示例:aihcq-h1plvp |
| bos_path | BOS源路径,示例:bos://my-bucket/datasets/ |
| bos_ak | 访问BOS的ak |
| bos_sk | 访问BOS的sk |
| pfs_path | PFS源路径,示例:/datasets/my-dataset |
| pfs_id | PFS实例ID |
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: bos-to-pfs
5 type: CustomTask
6 inputs:
7 - name: queue_id
8 type: string
9 hint: 队列ID
10 - name: bos_path
11 type: string
12 hint: 数据集的BOS路径
13 - name: bos_ak
14 type: string
15 hint: BOS存储桶的访问ak
16 - name: bos_sk
17 type: string
18 hint: BOS存储桶的访问sk
19 - name: pfs_path
20 type: string
21 hint: PFS源路径
22 - name: pfs_id
23 type: string
24 hint: PFS实例ID
25 spec:
26 queue: '{{inputs.parameters.queue_id}}'
27 jobType: PyTorchJob
28 command: |
29 WORK_DIR="$(pwd)"
30 ZIP_URL="https://doc.bce.baidu.com/bos-optimization/mac-bcecmd-0.5.10.zip"
31 ZIP_FILE="mac-bcecmd-0.5.10.zip"
32 EXTRACT_DIR="mac-bcecmd-0.5.10"
33 echo "正在下载 bcecmd 工具..."
34 curl -LO "$ZIP_URL"
35 if [ $? -ne 0 ]; then
36 echo "下载失败!请检查网络或 URL。"
37 exit 1
38 fi
39 echo "正在解压..."
40 unzip -o "$ZIP_FILE"
41 if [ $? -ne 0 ]; then
42 echo "解压失败!"
43 exit 1
44 fi
45 cd "$EXTRACT_DIR" || { echo " 无法进入目录 $EXTRACT_DIR"; exit 1; }
46 echo " 正在创建 credentials 文件..."
47 mkdir -p /root/.go-bcecli
48 cat > ~/.go-bcecli/credentials <<EOF
49 [Defaults]
50 Ak = "$BOS_AK"
51 Sk = "$BOS_SK"
52 EOF
53 chmod 600 ~/.go-bcecli/credentials
54 echo "bcecmd工具安装及配置完成,开始下载数据"
55 ./bcecmd bos sync {{inputs.parameters.bos_path}} /mnt/cluster/dataset
56 echo "数据集下载成功,已保存到PFS的{{inputs.parameters.pfs_path}}路径"
57 jobSpec:
58 image: registry.baidubce.com/inference/aibox-ubuntu:v2.0-22.04
59 replicas: 1
60 envs:
61 - name: BOS_AK
62 value: '{{inputs.parameters.bos_ak}}'
63 - name: BOS_SK
64 value: '{{inputs.parameters.bos_sk}}'
65 datasources:
66 - type: pfs
67 name: '{{inputs.parameters.pfs_id}}'
68 sourcePath: '{{inputs.parameters.pfs_path}}'
69 mountPath: /mnt/cluster/dataset
70tasks:
71 - name: sync-bos-to-pfs
72 taskTemplateName: bos-to-pfs
73 inputs:
74 - name: queue_id
75 value: aihcq-h1plvpzb5gh0
76 - name: bos_path
77 value: bos://my-bucket/datasets/
78 - name: bos_ak
79 value: <你的sk>
80 - name: bos_sk
81 value: <你的sk>
82 - name: pfs_path
83 value: /datasets/my-dataset
84 - name: pfs_id
85 value: pfs-xxxx
备份PFS数据到BOS
可用于动态备份PFS上的数据到BOS存储,释放PFS存储空间
参数说明:
| 参数名称 | 说明 |
|---|---|
| queue_id | 队列ID,示例:aihcq-h1plvp |
| bos_path | BOS源路径,示例:bos://my-bucket/datasets/ |
| bos_ak | 访问BOS的ak |
| bos_sk | 访问BOS的sk |
| pfs_path | PFS源路径,示例:/datasets/my-dataset |
| pfs_id | PFS实例ID |
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: pfs-to-bos
5 type: CustomTask
6 inputs:
7 - name: queue_id
8 type: string
9 hint: 队列ID
10 - name: bos_path
11 type: string
12 hint: 数据集的BOS路径
13 - name: bos_ak
14 type: string
15 hint: BOS存储桶的访问ak
16 - name: bos_sk
17 type: string
18 hint: BOS存储桶的访问sk
19 - name: pfs_path
20 type: string
21 hint: PFS源路径
22 - name: pfs_id
23 type: string
24 hint: PFS实例ID
25 spec:
26 queue: '{{inputs.parameters.queue_id}}'
27 jobType: PyTorchJob
28 command: |
29 WORK_DIR="$(pwd)"
30 ZIP_URL="https://doc.bce.baidu.com/bos-optimization/mac-bcecmd-0.5.10.zip"
31 ZIP_FILE="mac-bcecmd-0.5.10.zip"
32 EXTRACT_DIR="mac-bcecmd-0.5.10"
33 echo "正在下载 bcecmd 工具..."
34 curl -LO "$ZIP_URL"
35 if [ $? -ne 0 ]; then
36 echo "下载失败!请检查网络或 URL。"
37 exit 1
38 fi
39 echo "正在解压..."
40 unzip -o "$ZIP_FILE"
41 if [ $? -ne 0 ]; then
42 echo "解压失败!"
43 exit 1
44 fi
45 cd "$EXTRACT_DIR" || { echo " 无法进入目录 $EXTRACT_DIR"; exit 1; }
46 echo " 正在创建 credentials 文件..."
47 mkdir -p /root/.go-bcecli
48 cat > ~/.go-bcecli/credentials <<EOF
49 [Defaults]
50 Ak = "$BOS_AK"
51 Sk = "$BOS_SK"
52 EOF
53 chmod 600 ~/.go-bcecli/credentials
54 echo "bcecmd工具安装及配置完成,开始上传数据"
55 ./bcecmd bos sync /mnt/cluster/dataset {{inputs.parameters.bos_path}}
56 echo "数据集上传成功,已保存到BOS的{{inputs.parameters.bos_path}}路径"
57 jobSpec:
58 image: registry.baidubce.com/inference/aibox-ubuntu:v2.0-22.04
59 replicas: 1
60 envs:
61 - name: BOS_AK
62 value: '{{inputs.parameters.bos_ak}}'
63 - name: BOS_SK
64 value: '{{inputs.parameters.bos_sk}}'
65 datasources:
66 - type: pfs
67 name: '{{inputs.parameters.pfs_id}}'
68 sourcePath: '{{inputs.parameters.pfs_path}}'
69 mountPath: /mnt/cluster/dataset
70tasks:
71 - name: sync-pfs-to-bos
72 taskTemplateName: pfs-to-bos
73 inputs:
74 - name: queue_id
75 value: aihcq-h1plvpzb5gh0
76 - name: bos_path
77 value: bos://my-bucket/datasets/
78 - name: bos_ak
79 value: <你的sk>
80 - name: bos_sk
81 value: <你的sk>
82 - name: pfs_path
83 value: /datasets/my-dataset
84 - name: pfs_id
85 value: pfs-xxxx
清理PFS存储的数据
清理PFS上的数据,释放存储空间,如训练结束后自动清理数据集
参数说明:
| 参数名称 | 说明 |
|---|---|
| queue_id | 队列ID |
| pfs_path | PFS源路径,示例:/datasets/my-dataset |
| pfs_id | PFS实例ID |
1version: v1
2kind: PipelineTemplate
3taskTemplates:
4 - name: pfs-remove
5 type: CustomTask
6 inputs:
7 - name: queue_id
8 type: string
9 hint: 队列ID
10 - name: pfs_id
11 type: string
12 hint: PFS实例ID
13 - name: pfs_path
14 type: string
15 hint: PFS源路径
16 spec:
17 queue: '{{inputs.parameters.queue_id}}'
18 jobType: PyTorchJob
19 command: |
20 rm -r /mnt/cluster/dataset
21 echo "数据集删除成功,已删除PFS的{{inputs.parameters.pfs_path}}路径上的数据"
22 priority: normal
23 jobSpec:
24 image: registry.baidubce.com/inference/aibox-ubuntu:v2.0-22.04
25 replicas: 1
26 datasources:
27 - type: pfs
28 name: '{{inputs.parameters.pfs_id}}'
29 sourcePath: '{{inputs.parameters.pfs_path}}'
30 mountPath: /mnt/cluster/dataset
31tasks:
32 - name: pfs-file-remove
33 taskTemplateName: pfs-remove
34 inputs:
35 - name: queue_id
36 value: aihcq-h1plvpzb5gh0
37 - name: pfs_path
38 value: /datasets/my-dataset
39 - name: pfs_id
40 value: pfs-xxxx
更多模板持续更新中...
评价此篇文章
