查询训练任务事件
所有文档
menu

百舸异构计算平台 AIHC

查询训练任务事件

描述

获取一个任务系统事件。

请求结构

GET /api/v1/aijobs/{jobId}/events
Host:aihc.bj.baidubce.com
Authorization:authorization string
ContentType: application/json    

请求头域

除公共头域外,无其它特殊头域。

请求参数

参数名称 类型 是否必须 参数位置 说明
resourcePoolId String Query 参数 标识资源池的唯一标识符
jobId String Path 参数 训练任务ID
jobFramework String Query 参数 训练任务框架类型,当前支持 "PyTorchJob"
startTime String Query 参数 获取任务事件的起始时间,默认为任务创建时间(unix时间戳)
endTime String Query 参数 获取任务事件的结束时间,默认为 now (unix时间戳)

返回头域

除公共头域,无其它特殊头域。

返回参数

参数名称 类型 说明
events Array of Event 事件列表
total Number 事件的总数

返回示例

{
  "events": [
    {
      "reason": "JobTerminated",
      "message": "Job has been terminated. Deleting PodGroup",
      "firstTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "lastTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "count": 4,
      "type": "Normal"
    },
    {
      "reason": "SuccessfulDeletePodGroup",
      "message": "Deleted PodGroup: test-api-llama2-7b-4",
      "firstTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "lastTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "count": 4,
      "type": "Normal"
    },
    {
      "reason": "ExitedWithCode",
      "message": "Pod: default.test-api-llama2-7b-4-master-0 exited with code 1",
      "firstTimestamp": "2024-07-15 16:52:41 +0000 UTC",
      "lastTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "count": 2,
      "type": "Normal"
    },
    {
      "reason": "FailedToStartFaultTolerance",
      "message": "Pytorchjob: test-api-llama2-7b-4,  failed to start fault tolerance。Reason:check all nodes are healthy。",
      "firstTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "lastTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "count": 1,
      "type": "Warning"
    },
    {
      "reason": "JobFailed",
      "message": "PyTorchJob test-api-llama2-7b-4 is failed because 1 Master replica(s) failed.",
      "firstTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "lastTimestamp": "2024-07-15 16:52:50 +0000 UTC",
      "count": 1,
      "type": "Normal"
    },
    {
      "reason": "Error",
      "message": "Error pod test-api-llama2-7b-4-master-0 container pytorch exitCode: 1 terminated message: ",
      "firstTimestamp": "2024-07-15 16:52:41 +0000 UTC",
      "lastTimestamp": "2024-07-15 16:52:41 +0000 UTC",
      "count": 1,
      "type": "Warning"
    },
    {
      "reason": "SuccessfulCreateService",
      "message": "Created service: test-api-llama2-7b-4-master-0",
      "firstTimestamp": "2024-07-15 12:47:04 +0000 UTC",
      "lastTimestamp": "2024-07-15 12:47:04 +0000 UTC",
      "count": 1,
      "type": "Normal"
    },
    {
      "reason": "SuccessfulCreatePod",
      "message": "Created pod: test-api-llama2-7b-4-master-0",
      "firstTimestamp": "2024-07-15 12:47:04 +0000 UTC",
      "lastTimestamp": "2024-07-15 12:47:04 +0000 UTC",
      "count": 1,
      "type": "Normal"
    }
  ],
  "total": 8
}
上一篇
删除训练任务
下一篇
查询训练任务日志