获取训练任务调度诊断结果
更新时间:2025-05-23
描述
获取训练任务调度诊断的结果。关于调度诊断能力,详见:调度诊断
请求结构
Bash
1GET /api/v1/aifd/schedulediagnosis/job/query
2Host:aihc.bj.baidubce.com
3Authorization:authorization string
4ContentType: application/json
请求头域
除公共头域外,无其它特殊头域。
请求参数
参数名称 | 类型 | 是否必须 | 参数位置 | 说明 |
---|---|---|---|---|
diagnosisId | String | 是 | Query参数 | 任务调度诊断返回的diagnosisId |
返回头域
除公共头域,无其它特殊头域。
返回参数
参数名称 | 类型 | 是否必须 | 说明 |
---|---|---|---|
requestId | String | 是 | 请求ID |
result | Array of ScheduleDiagnosisResult | 是 | 成功请求时的任务信息 |
请求示例
JSON
1GET /api/v1/aifd/schedulediagnosis/job/query?diagnosisId=9de03782-1b7f-4183-951b-xxxxxxxx
2Host: aihc.bj.baidubce.com
3Authorization: string
返回示例
JSON
1{
2 "requestId": "c8427910-3451-4436-b8e7-56eb382b00fb",
3 "result": [
4 {
5 "detail": [
6 {
7 "allocatable": "",
8 "detail": "node(s) were unschedulable",
9 "job": "sulingang-diagnose-node-taint",
10 "namespace": "default",
11 "node": "10.3.0.118",
12 "pod": "sulingang-diagnose-node-taint-master-0",
13 "queue": "default",
14 "request": "",
15 "resourceName": ""
16 }
17 ],
18 "ruleName": "NodeAvailabilityDiagnostics",
19 "ruleType": "NodeAvailabilityDiagnostics"
20 },
21 {
22 "detail": [
23 {
24 "allocatable": "",
25 "detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
26 "job": "sulingang-diagnose-node-taint",
27 "namespace": "default",
28 "node": "fake-node-a800-01",
29 "pod": "sulingang-diagnose-node-taint-master-0",
30 "queue": "default",
31 "request": "",
32 "resourceName": ""
33 },
34 {
35 "allocatable": "",
36 "detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
37 "job": "sulingang-diagnose-node-taint",
38 "namespace": "default",
39 "node": "fake-node-a800-04",
40 "pod": "sulingang-diagnose-node-taint-master-0",
41 "queue": "default",
42 "request": "",
43 "resourceName": ""
44 },
45 {
46 "allocatable": "",
47 "detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
48 "job": "sulingang-diagnose-node-taint",
49 "namespace": "default",
50 "node": "fake-node-a800-03",
51 "pod": "sulingang-diagnose-node-taint-master-0",
52 "queue": "default",
53 "request": "",
54 "resourceName": ""
55 },
56 {
57 "allocatable": "",
58 "detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
59 "job": "sulingang-diagnose-node-taint",
60 "namespace": "default",
61 "node": "fake-node-a800-02",
62 "pod": "sulingang-diagnose-node-taint-master-0",
63 "queue": "default",
64 "request": "",
65 "resourceName": ""
66 }
67 ],
68 "ruleName": "NodeTaintToleranceDiagnostics",
69 "ruleType": "NodeAvailabilityDiagnostics"
70 },
71 {
72 "detail": [
73 {
74 "allocatable": "0",
75 "detail": "",
76 "job": "sulingang-diagnose-node-taint",
77 "namespace": "default",
78 "node": "10.3.0.123",
79 "pod": "sulingang-diagnose-node-taint-master-0",
80 "queue": "default",
81 "request": "8",
82 "resourceName": "nvidia.com/gpu"
83 },
84 {
85 "allocatable": "0",
86 "detail": "",
87 "job": "sulingang-diagnose-node-taint",
88 "namespace": "default",
89 "node": "10.3.0.122",
90 "pod": "sulingang-diagnose-node-taint-master-0",
91 "queue": "default",
92 "request": "8",
93 "resourceName": "nvidia.com/gpu"
94 }
95 ],
96 "ruleName": "GPUResourceDiagnostics",
97 "ruleType": "ResourceCapacityDiagnostics"
98 }
99 ]
100}
错误码
错误码 | 错误描述 | HTTP 状态码 | 说明 |
---|---|---|---|
InvalidParameter | Invalid parameter: XXX | 400 Bad Param | 参数校验不合法 |
InternalError | Internal error: XXX | 500 Internal Server Error | 服务内部错误 |
DiagnosisReportNotFound | The diagnosis report is not found, try later please. | 200 OK | 诊断报告未完成,请稍后重试 |
PreCheckError | Check before diagnosis error: XXX | 400 Bad Param | 诊断前检查项错误,具体如下: 1. 调度器组件版本和健康性检查 2. 任务合法性检查 3. 任务状态检查 4. 队列合法性检查 |