通过CLI使用百舸服务
更新时间:2025-06-19
使用命令行工具操作百舸与用户在百舸操作具有相同的权限控制,AK/SK所属用户需要在百舸控制台获得对应的操作权限(如资源池、队列、训练任务的开发权限、管理员权限)才能对资源池、队列、训练任务进行管理操作
通用语法和参数
Bash
1$ aihc [资源] <command> [flags] [options]
- 全局通用参数 --help/-h:所有命令可以通过此参数获得帮助信息,查看使用说明
- 列表页通用参数 --order :列表页接口支持排序参数,可选值asc, desc,默认值desc
--orderBy:设置排序字段,默认值createTime
--page:支持分页的接口可以指定页数,默认值1
-s, --size:分页大小,默认值100
p/n/q:带有翻页的命令,输p/n/q可以实现翻页或退出(上一页:p;下一页:n;退出:q)
查看参数说明
在输入任何命令之后添加 -h ,即可查看对应命令的参数说明,例如 aihc job list -h
Bash
1# 获取任务列表命令的参数说明
2$ aihc job list -h
3列出所有的训练任务。
4
5Usage:
6 aihc job list [flags]
7
8Aliases:
9 list, ls
10
11Flags:
12 -h, --help help for list
13 -n, --name string 任务名称关键字
14 --order string 排序方式(asc/desc) (default "desc")
15 -o, --order-by string 排序字段 (default "createdAt")
16 --page int 页码 (default 1)
17 -p, --pool string 资源池ID
18 -q, --queue string 队列ID
19 --size int 每页显示数量 (default 100)
20
21Global Flags:
22 -C, --config string Global configuration file. (default "/Users/luyuchao/.aihc/config")
资源池相关操作
获取资源池列表
Bash
1$ aihc pool list [flags]
2 -h, --help help for list
3 -l, --limit int Number of resource pools to return (default 100) 设置返回资源池最大数量
4 --offset int Offset of the first resource pool to return
5 --order string Sort order (asc, desc) (default "desc")
6 -o, --orderBy string Order by field (name, createTime) (default "createTime") 按照指定创建时间、用户名称查询
7 -p, --pageNo int Page number (default 1) 指定返回的页数
8 -s, --pageSize int Page size (default 100) 指定每页返回的资源池数量
- 命令示例
Bash
1# 当存在多页数据时,可以输入 p/n 进行上一页/下一页翻页;输入 q 退出当前命令
2$ aihc pool list
3
4Page 1 of 1 (Total items: 4)
5
6NAME ID STATUS NODE_COUNT GPU_COUNT CREATED_AT
7---------------------------------------------------------------------------------------------------------------------------------------
8jun-gateway cce-mj***fzk running 1/2 4/8 2025-03-06 06:41:33
9lingang-hpas-runtime-02 cce-zq***i4dy running 1/1 0/0 2025-02-10 13:44:45
10lingang-hpas-runtime cce-4k***u9u running 4/4 0/0 2025-02-10 12:19:42
11workflow-lijipeng cce-5u***vro running 4/4 0/0 2025-01-21 11:22:09
12
13Navigation:
14 q - Quit
15
16Enter command (p/n/q):
获取资源池详情
Bash
1$ aihc pool get [flags]
2-h, --help help for get
3-p, --pool string Resource pool ID 指定资源池ID
- 命令示例(获取指定资源池详情)
Bash
1$ aihc pool get -p cce-mjtzk
2
3resourcePool:
4 metadata:
5 createdAt: "2025-03-06T06:41:33Z"
6 id: cce-mjtzk
7 name: jun-gateway
8 updatedAt: "2025-03-06T08:34:59Z"
9 spec:
10 associatedCpromIds: []
11 associatedPfsId: ""
12 createdBy: unpeng
13 description: ""
14 forbidDelete: true
15 k8sVersion: 1.24.4
16 region: bj
17 status:
18 gpuCount:
19 total: 8
20 used: 4
21 gpuDetail:
22 - gpudescriptor: baidu.com/l20_cgpu
23 total: 8
24 used: 4
25 nodeCount:
26 total: 2
27 used: 1
28 phase: running
- 命令示例(获取默认资源池详情)
Bash
1# 当设置了默认资源池时,可以省略 -p/--pool 参数,工具直接使用默认资源池作为 -p/--pool参数
2$ aihc pool get
3
4resourcePool:
5 metadata:
6 createdAt: "2023-09-17T15:48:04Z"
7 id: cce-4h***1m
8 name: IAT-Regression-pei
9 updatedAt: "2025-02-11T08:21:50Z"
10 spec:
11 associatedCpromIds:
12 - cprom-xm1***3k3s7
13 associatedPfsId: pfs-p***jz
14 createdBy: ""
15 description: ""
16 forbidDelete: true
17 k8sVersion: 1.20.8
18 region: bj
19 status:
20 gpuCount:
21 total: 16
22 used: 1
23 gpuDetail:
24 - gpudescriptor: baidu.com/xpu
25 total: 8
26 used: 0
27 - gpudescriptor: baidu.com/a800_80g_cgpu
28 total: 8
29 used: 1
30 nodeCount:
31 total: 3
32 used: 1
33 phase: running
获取节点列表
Bash
1aihc node list [flags] [-p cce-xxx]
2 -h, --help help for node
- 命令示例(获取指定资源池的节点列表)
Bash
1$ aihc node list -p cce-mjtzk
2
3Page 1 of 1 (Total items: 3)
4
5nodeName statusPhase instanceName instanceId gpuTotal gpuAllocated region zone
6192.168.12.62 running cce-47vnv7ms-qervq3nr cce-4hw7gn1m-9n9h4nnm 8 1 bj zoneF
7192.168.15.82 running cce-8v14fyl5-5y5hnny6 cce-4hw7gn1m-vmmmfzua 0 0 bj zoneF
8192.168.12.74 running cce-6zwnp4zf-ofksnm6m cce-4hw7gn1m-z9ixjnjj 0 0 bj zoneF
9
10Navigation:
11q - Quit
12
13Enter command (p/n/q):
- 命令示例(获取默认资源池节点列表)
Bash
1# 当设置了默认资源池时,可以省略 -p/--pool 参数,工具直接使用默认资源池作为 -p/--pool参数
2$ aihc node list
3
4Page 1 of 1 (Total items: 3)
5
6nodeName statusPhase instanceName instanceId gpuTotal gpuAllocated region zone
7192.168.12.62 running cce-47vnv7ms-qervq3nr cce-4hw7gn1m-9n9h4nnm 8 1 bj zoneF
8192.168.15.82 running cce-8v14fyl5-5y5hnny6 cce-4hw7gn1m-vmmmfzua 0 0 bj zoneF
9192.168.12.74 running cce-6zwnp4zf-ofksnm6m cce-4hw7gn1m-z9ixjnjj 0 0 bj zoneF
10
11Navigation:
12q - Quit
13
14Enter command (p/n/q):
队列相关操作
获取队列列表
Bash
1aihc queue list/ls [flags]
2-h, --help help for list
3-p, --pool string Resource pool ID 指定资源池
- 命令示例(获取指定资源池的队列列表)
Bash
1 $ aihc queue list -p cce-mjtzk
2
3name state queueType reclaimable disableOversell createdTime
4openapi-regular--dtea Open Elastic True False 2025-03-16T19:47:18Z
563a9f0ea7bb98050796b649e85481845 Open Elastic True False 2025-03-14T03:47:13Z
6default Open Elastic True False 2025-03-14T03:47:12Z
- 命令示例(获取默认资源池队的列列表)
Bash
1# 当设置了默认资源池时,可以省略 -p/--pool 参数,工具直接使用默认资源池作为 -p/--pool参数
2$ aihc queue list
3
4name state queueType reclaimable disableOversell createdTime
5openapi-regular--dtea Open Elastic True False 2025-03-16T19:47:18Z
663a9f0ea7bb98050796b649e85481845 Open Elastic True False 2025-03-14T03:47:13Z
7default Open Elastic True False 2025-03-14T03:47:12Z
获取队列详情
Bash
1aihc queue get queueID [flags]
2-h, --help help for get
3-p, --pool string Resource pool ID 指定资源池ID
- 命令示例
Bash
1$ aihc queue get 63a9f0ea7bb98050796b649e85481845 -p cce-2k9sw0cjaihc
2
3$ aihc queue get 63a9f0ea7bb98050796b649e85481845
任务相关操作
获取任务列表
Bash
1aihc job list/ls [flags]
2
3Flags:
4 -h, --help help for list
5 -n, --name string 任务名称关键字
6 --order string 排序方式(asc/desc) (default "desc")
7 --page int 页码 (default 1)
8 -p, --pool string 资源池ID
9 --size int 每页显示数量 (default 100)
- 命令示例(指定资源池)
Bash
1$ aihc job list -p cce-2k0cj
2
3Page 1 of 1 (Total items: 6)
4
5
6NAME JOB_ID STATUS CREATED_AT
7aihc-createjob-test pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 Created 2025-02-17T07:59:35Z
8qwen2-vl-test-4-copy1 pytorchjob-a53c413c-8e96-416b-8d6a-bc3e1b16a406 Succeeded 2025-01-01T13:22:19Z
9qwen2-vl-test-4 pytorchjob-0389039d-aae5-4601-8e9e-fb230867baa4 ManualTermination 2025-01-01T12:00:52Z
10qwen2-vl-test-one-copy1 pytorchjob-c9ac1567-4306-4085-8432-5305f4dc600f ManualTermination 2024-12-31T15:22:26Z
11qwen2-vl-test-2 pytorchjob-5c6c22c4-2c8d-49e7-8c2e-46e5fc2caa24 ManualTermination 2024-12-31T14:26:05Z
12qwen2-vl-test-one pytorchjob-ba8b2830-06c6-462f-aaf7-8c2513733482 Succeeded 2024-12-31T14:18:47Z
13
14Navigation:
15q - Quit
16
17Enter command (p/n/q):
- 命令示例(默认资源池)
Bash
1$ aihc job list
2
3Page 1 of 1 (Total items: 6)
4
5
6NAME JOB_ID STATUS CREATED_AT
7aihc-createjob-test pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 Created 2025-02-17T07:59:35Z
8qwen2-vl-test-4-copy1 pytorchjob-a53c413c-8e96-416b-8d6a-bc3e1b16a406 Succeeded 2025-01-01T13:22:19Z
9qwen2-vl-test-4 pytorchjob-0389039d-aae5-4601-8e9e-fb230867baa4 ManualTermination 2025-01-01T12:00:52Z
10qwen2-vl-test-one-copy1 pytorchjob-c9ac1567-4306-4085-8432-5305f4dc600f ManualTermination 2024-12-31T15:22:26Z
11qwen2-vl-test-2 pytorchjob-5c6c22c4-2c8d-49e7-8c2e-46e5fc2caa24 ManualTermination 2024-12-31T14:26:05Z
12qwen2-vl-test-one pytorchjob-ba8b2830-06c6-462f-aaf7-8c2513733482 Succeeded 2024-12-31T14:18:47Z
13
14Navigation:
15q - Quit
16
17Enter command (p/n/q):
单任务创建
AIHC CLI工具创建任务基于百舸OpenAPI实现,因此支持的任务参数不会超出API支持,建议客户首先阅读创建任务接口文档(https://cloud.baidu.com/doc/AIHC/s/jm56inxn7)有助于快速上手创建任务命令
直接传参方式创建任务
Bash
1# 直接传参方式创建任务
2aihc job create [flags]
3
4flags:
5 --code-url string 代码URL,使用代码上传命令返回的URL
6 --code-dir string 代码在容器中的挂载路径,默认为 /workspace
7 --command string 指定训练任务的入口命令 (与--script-file二选一)
8 --ds-mountpath string 数据源挂载路径
9 --ds-name string 数据源名称
10 --ds-type string 数据源类型
11 --enable-bccl 是否启用BCCL,默认false
12 --enable-fault-tolerance 是否启用容错功能,默认false
13 --enable-rdma 是否启用RDMA,默认false
14 --env strings 环境变量 (key=value)
15 --fault-tolerance-args string 容错功能的详细参数
16 --framework string 任务框架类型 (default "pytorch")
17 --gpu strings GPU资源 (type=count)
18 -h, --help help for create
19 --host-network 是否启用主机网络,默认false
20 --image string 容器镜像
21 -f, --job-file strings 任务配置文件路径json/yaml
22 --label strings 标签 (key=value)
23 --local-code string 本地代码路径,创建任务时会先上传代码
24 --name string 任务名称
25 -p, --pool string 资源池ID (如未指定则使用配置文件中的默认值)
26 --priority string 任务优先级 (low/normal/high) (default "normal")
27 --privileged 是否启用特权模式,默认false
28 --replicas int32 任务副本数 (default 1)
29 --script-file string 命令脚本的路径,以脚本的方式指定训练任务的入口命令 (与--command二选一)
- 命令示例
Bash
1$ aihc job create --local-code Aihc \
2 --name cli-codeupload-test \
3 --image registry.baidubce.com/inf-qa/nginx:latest \
4 --framework PyTorchJob \
5 --command "sleep 1d" \
6 --replicas 4 \
7 --privileged=true \
8 --fault-tolerance-args="--enable-replace=true --enable-hang-detection=true --hang-detection-log-timeout-minutes=7 --hang-detection-startup-toleration-minutes=15 --hang-detection-stack-timeout-minutes=3 --max-num-of-unconditional-retry=2 --unconditional-retry-observe-seconds=3600 --custom-log-patterns=timeout1 --custom-log-patterns=timeout2 --enable-use-nodes-of-last-job=true --enable-checkpoint-migration=true --internal-fault-tolerance-alarm-phone=10086,10010" \
9 --priority high \
10 --enable-bccl=false \
11 --enable-fault-tolerance=true \
12 --local-code /codeDir/file --code-dir /workspace #代码路径 #代码上传挂载目录
13 -p cce-cm1jjxrq
使用任务模板创建任务
- 命令示例(使用JSON格式模板)
Bash
1# 支持使用json格式文件传递参数,详见参数模板,创建任务支持的参数详见创建任务接口参数:https://cloud.baidu.com/doc/AIHC/s/jm56inxn7
2# 接口请求body参数的json文件,需将在命令行运行的主机上先创建好任务信息
3$ aihc job create -f job_info.json
4
5# 使用command文件保存启动命令
6$ aihc job create -f job_info.json --script-file command.txt
7
8#创建任务时上传代码
9
10$ aihc job create -f job-info.yaml --local-code /file/path #本地代码路径
- 上传代码
Bash
1# 使用命令上传代码
2aihc code upload --folder /code/filepath
3
4Usage:
5 aihc code upload [flags]
6
7Flags:
8 -f, --folder string 指定要上传的代码文件夹路径
9 -h, --help help for upload
10 -p, --pool string 指定资源池ID,如未指定则使用配置文件中的默认值
11 -q, --queue string 指定队列ID
JSON格式任务参数参考模板:
JSON
1{
2 "name": "qwen2-vl-test-4-copy2",
3 "queue": "",
4 "jobFramework": "pytorch",
5 "jobSpec": {
6 "command": "cd /data \u0026\u0026 mkdir qwen2-vl\n\ncp /mnt/ca-p800-poc/llava-en-zh-300k.tar.gz /data/qwen2-vl/\n#cp /mnt/ca-p800-poc/llava-en-zh-2k.tar.gz /data/qwen2-vl/\ncd /data/qwen2-vl\n\ntar -zxvf llava-en-zh-300k.tar.gz \u0026\u0026 rm llava-en-zh-300k.tar.gz\n#tar -zxvf llava-en-zh-2k.tar.gz \u0026\u0026 rm llava-en-zh-2k.tar.gz\n\n# code\ncp /mnt/ca-p800-poc/models/qwen2-vl.tar.gz /home/\ncd /home/\ntar -zxvf qwen2-vl.tar.gz \u0026\u0026 rm qwen2-vl.tar.gz\n\n# model weights\ncp /mnt/ca-p800-poc/Qwen2-VL-7B-Instruct.tar.gz .\ntar -zxvf Qwen2-VL-7B-Instruct.tar.gz \u0026\u0026 rm Qwen2-VL-7B-Instruct.tar.gz\n\n#train lora\ncd /home/qwen2-vl/\nconda activate llamafactory_env\nexport PATH=/root/miniconda/envs/llamafactory_env/bin:$PATH\nexport PYTHONPATH=$PYTHONPATH:/home/qwen2-vl/\n\napt-get install dnsutils -y\npip install deepspeed==0.14.5\n\nsource env.sh\nnohup bash dist_train.sh lora \u0026\n\nsleep 100000",
7 "image": "ccr-2ccrtest-vpc.cnc.bj.baidubce.com/yetao04-ca-test/qwen2vl_p800_image:v1.0",
8 "imageConfig": {
9 "username": "",
10 "password": ""
11 },
12 "replicas": 4,
13 "resources": [
14 {
15 "name": "kunlunxin.com/xpu",
16 "quantity": 8
17 },
18 {
19 "name": "sharedMemory",
20 "quantity": 1024
21 }
22 ],
23 "envs": [
24 {
25 "name": "AIHC_JOB_NAME",
26 "value": "qwen2-vl-test-4-copy1"
27 },
28 {
29 "name": "NCCL_IB_DISABLE",
30 "value": "0"
31 }
32 ],
33 "enableRDMA": true,
34 "hostNetwork": true
35 },
36 "faultTolerance": false,
37 "labels": [
38 {
39 "key": "aijob.cce.baidubce.com/ai-user-id",
40 "value": "ac3553acbb8d4c5e9b212fc0a04c8f7d"
41 },
42 {
43 "key": "aijob.cce.baidubce.com/ai-user-name",
44 "value": "daichaonan"
45 },
46 {
47 "key": "aijob.cce.baidubce.com/create-from-aihcp",
48 "value": "true"
49 },
50 {
51 "key": "aijob.cce.baidubce.com/openapi-jobid",
52 "value": "pytorchjob-a53c413c-8e96-416b-8d6a-bc3e1b16a406"
53 }
54 ],
55 "priority": "high",
56 "datasources": [
57 {
58 "type": "pfsl1",
59 "sourcePath": "/yetao04",
60 "mountPath": "/mnt",
61 "name": "pfs-zesqWP",
62 "options": {
63 "sizeLimit": 1000,
64 "medium": "",
65 "readOnly": false,
66 "pfsL1ClusterIp": "172.16.0.221",
67 "pfsL1ClusterPort": "8888"
68 }
69 },
70 {
71 "type": "hostpath",
72 "sourcePath": "/ssd1",
73 "mountPath": "/data",
74 "name": "hostpath-1",
75 "options": {
76 "sizeLimit": 0,
77 "medium": "",
78 "readOnly": false
79 }
80 }
81 ],
82 "faultToleranceConfig": {
83 "enabledHangDetection": false,
84 "hangDetectionTimeoutMinutes": 0,
85 "faultToleranceLimit": 0,
86 "customFaultTolerancePattern": null
87 },
88 "alertConfig": null,
89 "enableBccl": false
90}
- 命令示例(使用YAML格式模板)
Bash
1支持使用yaml格式文件传递参数,详见参数模板,创建任务支持的参数详见创建任务接口参数:https://cloud.baidu.com/doc/AIHC/s/jm56inxn7
2# 接口请求body参数的yaml文件
3$ aihc job create -f ./job_info.yaml
4
5# 创建任务时上传代码及使用command文件保存启动命令
6$ aihc job create -f job_info.yaml --local-code folder --script-file command.txt
YAML格式任务参数参考模板:
YAML
1name: qwen2-vl-test-4-cli-test
2queue: ""
3jobFramework: pytorch
4jobSpec:
5 command: |
6 sleep 100000
7 image: ccr-2ccrtest-vpc.cnc.bj.baidubce.com/yetao04-ca-test/qwen2vl_p800_image:v1.0
8 imageConfig:
9 username: ""
10 password: ""
11 replicas: 4
12 resources:
13 - name: kunlunxin.com/xpu
14 quantity: 8
15 - name: rdma/hca
16 quantity: 1
17 - name: sharedMemory
18 quantity: 1024
19 envs:
20 - name: AIHC_JOB_NAME
21 value: qwen2-vl-test-4-copy1
22 - name: NCCL_IB_DISABLE
23 value: "0"
24 enableRDMA: true
25 hostNetwork: true
26faultTolerance: true
27labels:
28 - key: aijob.cce.baidubce.com/ai-user-id
29 value: ac3553acbb8d4c5e9b212fc0a04c8f7d
30 - key: aijob.cce.baidubce.com/ai-user-name
31 value: daichaonan
32 - key: aijob.cce.baidubce.com/create-from-aihcp
33 value: "true"
34 - key: aijob.cce.baidubce.com/openapi-jobid
35 value: pytorchjob-a53c413c-8e96-416b-8d6a-bc3e1b16a406
36priority: high
37codeSource:
38 mountPath: /workspace
39datasources:
40 - type: hostpath
41 sourcePath: /
42 mountPath: /mnt/rapidfs
43 name: rapidfs-p800
44enableBccl: false
45faultToleranceArgs: "--enable-replace=true --enable-hang-detection=true --hang-detection-log-timeout-minutes=7 --hang-detection-startup-toleration-minutes=15 --hang-detection-stack-timeout-minutes=3 --max-num-of-unconditional-retry=2 --unconditional-retry-observe-seconds=3600 --custom-log-patterns=timeout1 --custom-log-patterns=timeout2 --enable-use-nodes-of-last-job=true --enable-checkpoint-migration=true --internal-fault-tolerance-alarm-phone=10086,10010"
批量任务提交
Bash
1aihc job create -f job-1.yaml -f job-2.yaml ..-f ..
- 命令示例
Bash
1$ aihc job create -f ./job-1.yaml -f ./job-2.yaml
删除训练任务
Bash
1aihc job delete jobID [flags]
2 -h, --help help for delete
3 -p, --pool string 资源池ID
- 示例命令
Bash
1# 删除指定资源池下任务
2$ aihc job delete pytorchjob-c0d3504cfd-b44b-6dd7c39dde59 -p cce-2k0cj
3
4# 删除默认资源池下任务
5$ aihc job delete pytorchjob-c0d3504cfd-b44b-6dd7c39dde59
停止训练任务
Bash
1aihc job stop jobID [flags]
2
3Flags:
4 -h, --help help for stop
5 -p, --pool string 资源池ID
- 示例命令
Bash
1# 停止指定资源池下任务
2$ aihc job stop pytorchjob-c0d3504cfd-b44b-6dd7c39dde59 -p cce-2k0cj
3
4# 停止默认资源池下任务
5$ aihc job stop pytorchjob-c0d3504cfd-b44b-6dd7c39dde59
获取任务详情
Bash
1aihc job get jobID [flags]
2Flags:
3 -h, --help help for get
4 --pods 是否展示作业的实例信息
5 -p, --pool string 资源池ID
6 -s, --status 显示任务状态简要信息
- 示例命令
Bash
1# 获取默认资源池下指定任务详情
2$ aihc job get pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59
3
4# 获取指定资源池下指定任务详情
5$ aihc job get pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 -p cce-2k9sw0cj
6
7Using default pool ID from config: cce-4hw7gn1m
8failed to get job details: [Code: cce.warning.GetAIJobByJobIdFailed; Message: get job by jobid failed, err: pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 not found job; RequestId: a17a6ab1-9045-4346-9f1a-53733c621321]
9(base) luyuchao@luyuchaodeMacBook-Pro ~ % aihc job get pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 -p cce-2k9sw0cj
10openapigetjobresponseresult:
11 jobid: pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59
12 name: aihc-createjob-test
13 resourcepoolid: cce-2k9sw0cj
14 command: sleep 100
15 createdat: "2025-02-17T07:59:35Z"
16 finishedat: ""
17 runningat: ""
18 scheduledat: ""
19 datasources: []
20 enablefaulttolerance: false
21 customfaulttolerancepattern: []
22 labels:
23 - key: aijob.cce.baidubce.com/ai-user-id
24 value: eca97e148cb74e9683d7b7240829d1ff
25 - key: aijob.cce.baidubce.com/ai-user-name
26 value: root
27 - key: aijob.cce.baidubce.com/create-from-aihcp-api
28 value: "true"
29 - key: aijob.cce.baidubce.com/openapi-jobid
30 value: pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59
31 - key: key
32 value: value
33 priority: high
34 queue: default
35 status: Created
36 image: registry.baidubce.com/aihc-aiak/aiak-training-llm
37 resources:
38 - name: baidu.com/a100_80g_cgpu
39 quantity: 8
40 enablerdma: false
41 hostnetwork: false
42 replicas: 2
43 envs:
44 - name: AIHC_JOB_NAME
45 value: aihc-createjob-test
46 - name: AIHC_TENSORBOARD_LOG_PATH
47 value: ""
48 - name: CUDA_DEVICE_MAX_CONNECTIONS
49 value: "1"
50 - name: NCCL_DEBUG
51 value: INFO
52 jobframework: pytorch
53 queueingsequence: null
54 enablebccl: false
55 enablebcclstatus: unknown
56 enablebcclerrorreason: ""
57 k8suid: 3168a331-b584-46c6-b6cc-b01214e217a7
58 k8snamespace: default
59podlist:
60 listmeta:
61 totalitems: 0
62 pods: []
查询任务状态
Bash
1aihc job get jobID --status/-s
- 示例命令
Bash
1$ aihc job get pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 -p cce-2k9sw0cj -s
2name priority pool/queue replicas status runningAt scheduledAt createdAt
3aihc-createjob-test high cce-2k9sw0cj/default 2 Created 2025-02-17T07:59:35Z
查询任务实例列表
Bash
1aihc pod list jobID
- 示例命令
Bash
1# 查询指定资源池下指定任务的实例列表
2$ aihc pod list pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 -p cce-2k9sw0cj
3
4# 查询默认资源池下指定任务的实例列表
5$ aihc pod list pytorchjob-a3a7f6cf-43c9-4792-b0e7-fdb454a7555b
6
7replicaType name namespace podPhase status creationTimestamp
8master pxy-moe-48hours-cpu-master-0 default Running Running 2024-12-17T13:22:41Z
连接任务实例
Bash
1aihc job exec jobID [flags]
2
3Flags:
4 -c, --container string 容器名称
5 -h, --help help for exec
6 -i, --interactive 保持标准输入打开
7 -n, --namespace string 命名空间 (默认为 default)
8 -p, --pool string 资源池ID
9 -t, --tty 分配伪终端
- 示例命令
Bash
1$ aihc job exec pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59 -it pxy-moe-48hours-cpu-master-0 bin/bash
复制任务配置
Bash
1aihc job export jobID [flags]
2
3Flags:
4 -h, --help help for export
5 -p, --pool string 资源池ID
- 示例命令
Bash
1$ aihc job export pytorchjob-c0d35053-229d-4cfd-b44b-6dd7c39dde59
查询任务日志
Bash
1aihc job logs jobID [flags]
2
3Flags:
4 -c, --chunk string 输出日志按着chunk数进行汇聚,例如将10行日志为1条记录,默认0
5 --filepath string 日志文件路径
6 -h, --help help for logs
7 --log-source string 日志来源,例如node
8 --marker string 日志的起始位置,用于分页查询
9 -m, --max-lines string 日志的最大行数
10 -n, --namespace string 命名空间 (默认为 default)
11 --podname string Pod名称
12 -p, --pool string 资源池ID
13 -q, --queue string 队列ID
14 -s, --start-time string 日志的起始时间,unix时间戳
- 示例命令
Bash
1$ aihc job logs pytorchjob-a3a7f6cf-43c9-4792-b0e7-fdb454a7555b --podname ui-test-test-running-master-0 -p cce-hcuw9ybk