获取训练任务调度诊断结果
更新时间:2025-01-08
描述
获取训练任务调度诊断的结果。关于调度诊断能力,详见:调度诊断
请求结构
GET /api/v1/aifd/schedulediagnosis/job/query
Host:aihc.bj.baidubce.com
Authorization:authorization string
ContentType: application/json
请求头域
除公共头域外,无其它特殊头域。
请求参数
参数名称 | 类型 | 是否必须 | 参数位置 | 说明 |
---|---|---|---|---|
diagnosisId | String | 是 | Query参数 | 任务调度诊断返回的diagnosisId |
返回头域
除公共头域,无其它特殊头域。
返回参数
参数名称 | 类型 | 是否必须 | 说明 |
---|---|---|---|
requestId | String | 是 | 请求ID |
result | Array of ScheduleDiagnosisResult | 是 | 成功请求时的任务信息 |
请求示例
GET /api/v1/aifd/schedulediagnosis/job/query?diagnosisId=9de03782-1b7f-4183-951b-xxxxxxxx
Host: aihc.bj.baidubce.com
Authorization: string
返回示例
{
"requestId": "c8427910-3451-4436-b8e7-56eb382b00fb",
"result": [
{
"detail": [
{
"allocatable": "",
"detail": "node(s) were unschedulable",
"job": "sulingang-diagnose-node-taint",
"namespace": "default",
"node": "10.3.0.118",
"pod": "sulingang-diagnose-node-taint-master-0",
"queue": "default",
"request": "",
"resourceName": ""
}
],
"ruleName": "NodeAvailabilityDiagnostics",
"ruleType": "NodeAvailabilityDiagnostics"
},
{
"detail": [
{
"allocatable": "",
"detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
"job": "sulingang-diagnose-node-taint",
"namespace": "default",
"node": "fake-node-a800-01",
"pod": "sulingang-diagnose-node-taint-master-0",
"queue": "default",
"request": "",
"resourceName": ""
},
{
"allocatable": "",
"detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
"job": "sulingang-diagnose-node-taint",
"namespace": "default",
"node": "fake-node-a800-04",
"pod": "sulingang-diagnose-node-taint-master-0",
"queue": "default",
"request": "",
"resourceName": ""
},
{
"allocatable": "",
"detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
"job": "sulingang-diagnose-node-taint",
"namespace": "default",
"node": "fake-node-a800-03",
"pod": "sulingang-diagnose-node-taint-master-0",
"queue": "default",
"request": "",
"resourceName": ""
},
{
"allocatable": "",
"detail": "node(s) had untolerated taint: {kwok.x-k8s.io/node: fake}",
"job": "sulingang-diagnose-node-taint",
"namespace": "default",
"node": "fake-node-a800-02",
"pod": "sulingang-diagnose-node-taint-master-0",
"queue": "default",
"request": "",
"resourceName": ""
}
],
"ruleName": "NodeTaintToleranceDiagnostics",
"ruleType": "NodeAvailabilityDiagnostics"
},
{
"detail": [
{
"allocatable": "0",
"detail": "",
"job": "sulingang-diagnose-node-taint",
"namespace": "default",
"node": "10.3.0.123",
"pod": "sulingang-diagnose-node-taint-master-0",
"queue": "default",
"request": "8",
"resourceName": "nvidia.com/gpu"
},
{
"allocatable": "0",
"detail": "",
"job": "sulingang-diagnose-node-taint",
"namespace": "default",
"node": "10.3.0.122",
"pod": "sulingang-diagnose-node-taint-master-0",
"queue": "default",
"request": "8",
"resourceName": "nvidia.com/gpu"
}
],
"ruleName": "GPUResourceDiagnostics",
"ruleType": "ResourceCapacityDiagnostics"
}
]
}
错误码
错误码 | 错误描述 | HTTP 状态码 | 说明 |
---|---|---|---|
InvalidParameter | Invalid parameter: XXX | 400 Bad Param | 参数校验不合法 |
InternalError | Internal error: XXX | 500 Internal Server Error | 服务内部错误 |
DiagnosisReportNotFound | The diagnosis report is not found, try later please. | 200 OK | 诊断报告未完成,请稍后重试 |
PreCheckError | Check before diagnosis error: XXX | 400 Bad Param | 诊断前检查项错误,具体如下: 1. 调度器组件版本和健康性检查 2. 任务合法性检查 3. 任务状态检查 4. 队列合法性检查 |