通过VK获取实例监控指标
除了通过百度云监控BCM查看 BCI 实例监控指标外,BCI 还支持通过 VK 查看 BCI 实例监控指标。
通过 VK 查看监控指标的方式与 Kubelet 类似,本文将分三部分介绍:
- 监控指标来源
- 支持查看的指标
- 查看指标的方式
监控指标来源
BCI实例监控指标的流向如下图所示
支持查看的指标
通过VK提供的监控指标是prometheus指标格式,与BCM上看到的指标有较大的差异。vk 提供的监控指标格式如下
# 指标的帮助说明
# 指标名称 指标类型
指标名称{标签,标签,...} 指标取值 指标时间戳
指标名称{标签,标签,...} 指标取值 指标时间戳
# 指标的帮助说明
# 指标名称 指标类型
指标名称{标签,标签,...} 指标取值 指标时间戳
指标名称{标签,标签,...} 指标取值 指标时间戳
指标分为多类,每一类指标的指标名称和指标类型是都一样的,每一类指标又可分为多条(指标名称一样,标签不一样)。
每条指标都包含指标名称、标签集合、指标取值、指标时间戳四个部分。
如下图示例说明:
支持的指标:
CPU
- container_cpu_usage_seconds_total:容器创建以来,累计使用cpu的总时间(单位:秒)。
内存
- container_memory_working_set_bytes:容器正在使用的内存字节数(单位:字节)。使用的内存字节数不包含Page Cache占用的内存空间
磁盘
- container_fs_reads_bytes_total:容器创建以来,累计从文件系统读取的字节数(单位:字节)。
- container_fs_writes_bytes_total:容器创建以来,累计写入到文件系统的字节数(单位:字节)。
网络
- container_network_receive_bytes_total:容器创建以来,累计接收的网络数据字节数(单位:字节)。
- container_network_receive_packets_total:容器创建以来,累计接收的网络数据IP包数(单位:个)。
- container_network_transmit_bytes_total:容器创建以来,累计发送的网络数据字节数(单位:字节)。
- container_network_transmit_packets_total:容器创建以来,累计发送的网络数据IP包数(单位:个)。
GPU指标
目前BCI支持的GPU指标如下:
指标名称 | 类型 | 描述 | 示例值 |
DCGM_FI_DEV_SM_CLOCK | gauge | # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz). | 1695 |
DCGM_FI_DEV_MEM_CLOCK | gauge | # HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz). | 6250 |
DCGM_FI_DEV_GPU_TEMP | gauge | # HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C). | 56 |
DCGM_FI_DEV_POWER_USAGE | gauge | # HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W). | 75.738000 |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | counter | # HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ). | 164108202242 |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | counter | # HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries. | 0 |
DCGM_FI_DEV_GPU_UTIL | gauge | # HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %). | 0 |
DCGM_FI_DEV_MEM_COPY_UTIL | gauge | # HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %). | 0 |
DCGM_FI_DEV_ENC_UTIL | gauge | # HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %). | 0 |
DCGM_FI_DEV_DEC_UTIL | gauge | # HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %). | 0 |
DCGM_FI_DEV_XID_ERRORS | gauge | # HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered. | 0 |
DCGM_FI_DEV_FB_FREE | gauge | # HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB). | 5774 |
DCGM_FI_DEV_FB_USED | gauge | # HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB). | 16957 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | counter | # HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes | 0 |
DCGM_FI_DEV_VGPU_LICENSE_STATUS | gauge | # HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status | 0 |
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS | counter | # HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors | 0 |
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS | counter | # HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors | 0 |
DCGM_FI_DEV_ROW_REMAP_FAILURE | gauge | # HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed | 0 |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | gauge | # HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %). | 0.037385 |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | gauge | # HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %). | 0.130759 |
DCGM_FI_PROF_SM_ACTIVE | gauge | # HELP DCGM_FI_PROF_SM_ACTIVE The ratio of cycles an SM has at least 1 warp assigned (in %). | 0.110320 |
DCGM_FI_PROF_SM_OCCUPANCY | gauge | # HELP DCGM_FI_PROF_SM_OCCUPANCY The ratio of number of warps resident on an SM (in %). | 0.045643 |
DCGM_FI_PROF_DRAM_ACTIVE | gauge | # HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %). | 0.065867 |
DCGM_FI_PROF_PCIE_TX_BYTES | counter | # HELP DCGM_FI_PROF_PCIE_TX_BYTES The number of bytes of active pcie tx data including both header and payload. | 64522980 |
DCGM_FI_PROF_PCIE_RX_BYTES | counter | # HELP DCGM_FI_PROF_PCIE_RX_BYTES The number of bytes of active pcie rx data including both header and payload. | 8436715 |
如您使用CCE的prometheus 监控,在VK的cadvisor接口访问获取指标,需要将CCE原始的cadvisor指标采集任务关闭,因为原始的cadvisor指标只采集的免费的指标;
将cadvisor指标拷贝一份,添加想要采集的DCGM指标,配置如下:
job_name: 'cadvisor-gpu'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
metrics_path: /metrics/cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [ instance ]
action: replace
target_label: node
metric_relabel_configs:
- source_labels: [__name__]
action: keep
regex: (machine_(cpu_cores|memory_bytes)|container_(cpu_usage_seconds_total|fs_limit_bytes|fs_reads_bytes_total|
fs_usage_bytes|fs_writes_bytes_total|memory_working_set_bytes|network_receive_bytes_total|
network_receive_packets_dropped_total|network_receive_packets_total|network_transmit_bytes_total|
network_transmit_packets_dropped_total|network_transmit_packets_total)|DCGM_(FI_DEV_GPU_UTIL|FI_DEV_GPU_TEMP|FI_DEV_FB_USED|FI_DEV_FB_FREE|FI_DEV_POWER_USAGE|FI_PROF_SM_ACTIVE|FI_PROF_PCIE_RX_BYTES|FI_PROF_PCIE_TX_BYTES))
支持的标签:
- container:容器名称
- id:容器复杂ID
- image:容器镜像地址
- name:容器简单ID
- namespace:Pod所在的名字空间
- pod:容器所在的Pod
- device:指标所属的块设备,此标签只在 container_fs_xxx 等几个指标中存在
- interface:指标所属的网卡设备,此标签只在 container_network_xxx 等几个指标中存在
注意:使用了镜像加速功能的容器,其监控指标中的 image 并非用户原始镜像,而是 {ACCELERATE_PREFIX}/transfer/{ACCOUNT_ID}/{USER_ORIGINAL_IMAGE}_accelerate {ACCELERATE_PREFIX} 镜像加速添加的前缀 {ACCOUNT_ID} 用户的账户ID {USER_ORIGINAL_IMAGE} 用户原始镜像地址
查看指标的方式
通过 VK 查看BCI监控指标的方式,与通过 Kubelet 查看 cAdvisor 指标的方式一样,通过如下接口查看
Host: {KubeletServer}
Port: 10250
URI: GET /metrics/cadvisor
Param:无
通过 vk 查看BCI实例监控指标,示例如下
./kubectl --kubeconfig={kubeconfig_path} get --raw "/api/v1/nodes/{vk_bci_kubelet_name}/proxy/metrics/cadvisor"
// {vk_bci_kubelet_name} vk 的 nodeName
通过 Prometheus 采集存储 BCI实例监控指标,简单配置(prometheus.yml)如下
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "cadvisor"
static_configs:
- targets: ['47.92.228.38:10250'] // 此处 47.92.228.38 是 vk 因特网可访问的 ip,10250 是 vk 的 https server 端口(注意不是 http)
scheme: https
metrics_path: /metrics/cadvisor
tls_config:
insecure_skip_verify: true // 作为一个demo示例,关闭 tls 协议
authorization:
credentials_file: /home/work/prometheus/token // 可以正常访问 node server 的 bearer token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- targets:vk 的 ip:port,注意该 ip 是需要能够被 prometheus 访问的,port 是 kubelet 对外暴露 /metrics/cadvisor 接口的端口(一般是10250)
- credentials_file:有访问 node 权限的 service account 的 token;
启动命令
./prometheus --config.file="prometheus.yml"