对接Cprom实现监控告警
更新时间:2024-09-19
概述
服务网格可实现微服务无侵入地获得服务间请求的监控指标数据,本文档帮助用户实现服务网格CSM产品对接Prometheus监控服务(CProm产品),实现对服务网格中指标的监控告警配置和大盘展示。
前提条件
- 已创建与Kubernetes集群同地域的CProm实例,详情请参考:创建CProm实例。
- 对于已运行工作负载的Kubernetes集群,需要安装CProm采集Agent,用于采集指标,详情请参考:Agent管理。
- 注意:托管服务网格暂不支持数据面监控告警。
操作步骤
托管网格
开启控制面指标监控
注意:被服务网格实例纳管的CCE集群,需要安装CProm采集Agent,否则无法选择相关的Cprom实例
1.登录百度智能云控制台,选择“产品服务>云原生>服务网格 CSM>网格列表"
2.单击目标网格实例名称,然后左侧导航栏选择可观测管理 > Prometheus监控,在Prometheus监控页面选择立即开启,并选择对应的Cprom实例
注意:被服务网格实例纳管的CCE集群,需要安装CProm采集Agent,否则无法选择相关的Cprom实例
1.登录百度智能云控制台,选择“产品服务>云原生>服务网格 CSM>网格列表"
2.单击目标网格实例名称,然后左侧导航栏选择可观测管理 > Prometheus监控,在Prometheus监控页面选择立即开启,并选择对应的Cprom实例
配置控制面指标监控
开启控制面指标监控后,您通过可观测管理 > Prometheus监控页面
- 选择Grafana服务,跳转至Grafana信息页,您可通过Grafana公网域名以及对应的用户名密码访问Grafana大盘;
- 选择查看详情,您可查看当前Cprom实例信息;
- 选择配置告警,跳转至Cprom对应页面进行配置,具体指标选择及告警规则配置可参考下文:
控制面监控指标
指标名称 | 类型 | 描述 |
---|---|---|
endpoint_no_pod | LastValue | 没有关联Pod的端点 |
pilot_endpoint_not_ready | LastValue | 发现处于未就绪状态的端点 |
pilot_destrule_subsets | LastValue | 相同主机目的地规则的重复子集 |
pilot_duplicate_envoy_clusters | LastValue | 由具有相同主机名的服务条目引起的重复envoy集群 |
pilot_no_ip | LastValue | 在端点表中找不到Pod,可能无效 |
pilot_eds_no_instances | LastValue | 没有实例的集群数量 |
pilot_vservice_dup_domain | LastValue | 具有重复域的虚拟服务数量 |
citadel_server_root_cert_expiry_timestamp | LastValue | Citadel根证书过期的unix时间戳(秒)。负时间表示证书已过期 |
galley_validation_failed | Sum | 资源验证失败总数 |
控制面告警配置参考
# Pilot
- alert: IstioPilotPodNotInEndpointTable
annotations:
summary: "Pilot pods not found in the endpoint table"
description: "Pods not found in the endpoint table, possibly invalid"
expr: > pilot_no_ip > 0
- alert: IstioPilotEndpointNotReady
annotations:
summary: "Pilot endpoint found in unready state"
description: "Pilot endpoint found in unready state for 30 second"
expr: > pilot_endpoint_not_ready > 0
for: 30s
- alert: IstioPilotDestruleSubsetsException
annotations:
summary: "Pilot pilot_destrule_subsets is greater than 0"
description: "pilot_destrule_subsets Duplicate subsets across destination rules for same host"
expr: > pilot_destrule_subsets > 0
- alert: IstioPilotDuplicateEnvoyClustersException
annotations:
summary: "Pilot pilot_duplicate_envoy_clusters is greater than 0"
description: "pilot_duplicate_envoy_clusters Duplicate envoy clusters caused by service entries with same hostname"
expr: > pilot_duplicate_envoy_clusters > 0
- alert: IstioPilotDuplicateEntry
expr: sum(rate(pilot_duplicate_envoy_clusters{}[1m])) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Istio Pilot Duplicate Entry on clusterID {{ $labels.clusterID }} and region {{ $labels.region }})
description: "Istio pilot duplicate entry error.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: IstioPilotEndpointNoPodException
annotations:
summary: "Pilot endpoint_no_pod is greater than 0"
description: "endpoint_no_pod Endpoints without an associated pod"
expr: > endpoint_no_pod > 0
- alert: IstioPilotEdsNoInstancesException
annotations:
summary: "Pilot pilot_eds_no_instances is greater than 0"
description: "pilot_eds_no_instances Number of clusters without instances"
expr: > pilot_eds_no_instances > 0
- alert: IstioPilotVserviceDupDomainException
annotations:
summary: "Pilot pilot_vservice_dup_domain is greater than 0"
description: "pilot_vservice_dup_domain Virtual services with dup domains"
expr: > pilot_vservice_dup_domain > 0
# CITADEL
- alert: IstioCitadelRootCertError
annotations:
summary: "Citadel root certificate internal error occured"
description: "Citadel root certificate internal error occured on clusterID {{ $labels.clusterID }} and region {{ $labels.region }}"
expr: >
citadel_server_root_cert_expiry_timestamp < 0
- alert: IstioCitadelCertIssuanceFailure
annotations:
summary: "Citadel certificate issuance failed"
description: "Citadel certificate issuance failed in last 1 minutes"
expr: >
(citadel_server_csr_count - citadel_server_success_cert_issuance_count) > (citadel_server_csr_count offset 1m - citadel_server_success_cert_issuance_count offset 1m)
- alert: IstioCitadelCsrSignError
annotations:
summary: "Citadel CSR signing error"
description: "Citadel CSR signing error occured in last 1 minutes on clusterID {{ $labels.clusterID }} and region {{ $labels.region }}"
expr: >
(absent(citadel_server_csr_sign_err_count offset 1m) == 1 and citadel_server_csr_sign_err_count > 0) or (citadel_server_csr_sign_err_count - citadel_server_csr_sign_err_count offset 1m > 0)
# GALLEY
- alert: IstioGalleyValidationFailed
annotations:
summary: "Galley validation failed"
description: "Galley validation failed in last 1 minutes"
expr: >
(absent(galley_validation_failed offset 1m) == 1 and galley_validation_failed > 0) or (galley_validation_failed - galley_validation_failed offset 1m > 0)
独立网格
开启数据面指标监控
注意:被服务网格实例纳管的cce集群,需要安装CProm采集Agent,否则无法选择相关的Cprom实例
- 方式一:在服务网格实例创建时,开启“监控指标采集”并选择对应的Cprom实例
- 方式二:针对已有的服务网格,选择服务网格 > 网格管理,在网格管理页面,单击目标实例名称,然后左侧导航栏选择可观测管理 > Prometheus监控,在监控页面选择开启,并选择对应的Cprom实例
配置数据面指标监控
开启数据面指标监控后,您通过可观测管理 > Prometheus监控页面
- 选择Grafana服务,跳转至Grafana信息页,您可通过Grafana公网域名访问Grafana大盘;
- 选择查看详情,您可查看当前Cprom实例信息;
- 选择配置告警,跳转至Cprom对应页面进行配置,具体指标选择及告警规则配置可参考下文:
数据面监控指标:
范围 | 名称 | 功能 |
---|---|---|
Envoy | IstioEnvoyInternalUpstreamReq503TooHigh | 503内部上游响应的数量高于1%,比例过高。 |
Envoy | IstioEnvoyInternalUpstreamReq200TooLow | 200内部上游响应的数量低于99.9%,比例过低。 |
Envoy | IstioEnvoyUpstreamReq503TooHigh | Envoy 的 HTTP 503 上游响应的百分比过高 |
Envoy | IstioEnvoyUpstreamReq200TooLow | Envoy 的 HTTP 200 上游响应的百分比过低 |
Envoy | IstioEnvoyClusterBindErrors | Envoy Cluster 集群绑定错误 |
Envoy | IstioEnvoyClusterDstHostInvalid | Envoy Cluster 集群目标主机无效 |
数据面告警配置参考:
- alert: IstioEnvoyInternalUpstreamReq503TooHigh
annotations:
summary: 'Envoy Percentage of HTTP 503 internal upstream responses is too high'
description: "The amount of 503 internal upstream responses is higher than 1%. It is too high"
expr: >
rate(envoy_cluster_internal_upstream_rq_503[1m])/rate(envoy_cluster_internal_upstream_rq_completed[1m]) > 0.01
- alert: IstioEnvoyInternalUpstreamReq200TooLow
annotations:
summary: 'Envoy Percentage of HTTP 200 internal upstream responses is too low'
description: "The amount of 200 internal upstream responses is lower than 99.9%. It is too low"
expr: >
rate(envoy_cluster_internal_upstream_rq_200[1m])/rate(envoy_cluster_internal_upstream_rq_completed[1m]) < 0.999
- alert: IstioEnvoyUpstreamReq503TooHigh
annotations:
summary: 'Envoy Percentage of HTTP 503 upstream responses is too high'
description: "The amount of 503 upstream responses is higher than 1%. It is too high"
expr: >
rate(envoy_cluster_upstream_rq_503[1m])/rate(envoy_cluster_upstream_rq_completed[1m]) > 0.01
- alert: IstioEnvoyUpstreamReq200TooLow
annotations:
summary: 'Envoy Percentage of HTTP 200 upstream responses is too low'
description: "The amount of 200 upstream responses is lower than 99.9%. It is too low"
expr: >
rate(envoy_cluster_upstream_rq_200[1m])/rate(envoy_cluster_upstream_rq_completed[1m]) < 0.999
- alert: IstioEnvoyClusterBindErrors
annotations:
summary: "Envoy cluster binding errors"
description: "Error in binding cluster with {{ $labels.pod_name }} pod in {{ $labels.namespace }} namespace."
expr: >
envoy_cluster_bind_errors > 0
- alert: IstioEnvoyClusterDstHostInvalid
annotations:
summary: "Envoy cluster destination host invalid"
description: "Envoy cluster destination host {{ $labels.pod_name }} in {{ $labels.namespace }} namespace invalid for 1 minutes"
expr: > envoy_cluster_original_dst_host_invalid > 0
for: 1m