简介:本文详细解析Prometheus监控K8s集群的核心架构、配置方法及优化策略,涵盖服务发现、数据抓取、告警规则等关键环节,提供从部署到调优的全流程指导。
Kubernetes(K8s)作为容器编排领域的标准,其动态性、分布式特性对监控系统提出了更高要求。传统监控工具(如Zabbix、Nagios)难以适应K8s中Pod频繁创建/销毁、服务动态扩容的场景。而Prometheus凭借以下优势成为K8s监控的首选方案:
kubernetes_sd_config自动发现K8s资源,支持以下模式:
scrape_configs:- job_name: 'kubernetes-pods'kubernetes_sd_configs:- role: podrelabel_configs:# 筛选带特定注解的Pod(如prometheus.io/scrape=true)- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]action: keepregex: true
# 查询所有节点CPU使用率100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)# 查询Pod内存限制与使用量对比container_memory_working_set_bytes{pod=~"nginx-.*"} / container_spec_memory_limit_bytes{pod=~"nginx-.*"} * 100
route和receiver定义告警路由策略:
route:group_by: ['alertname']receiver: 'email-team'routes:- match:severity: 'critical'receiver: 'pagerduty'receivers:- name: 'email-team'email_configs:- to: 'ops@example.com'
groups:- name: k8s-cluster.rulesrules:- alert: HighCPUUsageexpr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 85for: 10mlabels:severity: warningannotations:summary: "Node {{ $labels.instance }} CPU usage is high"
# 添加Prometheus社区Helm仓库helm repo add prometheus-community https://prometheus-community.github.io/helm-charts# 部署Prometheus Operator(推荐生产环境使用)helm install prometheus prometheus-community/kube-prometheus-stack \--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
--storage.tsdb.retention.time=30d设置30天数据保留。scrape_interval(如核心服务15s,次要服务60s)。prometheus.io/scrape和端口暴露(需为HTTP且路径为/metrics)。__name__等高基数标签,优先通过by聚合指标。federation抓取边缘集群指标。promhttp库快速实现:
package mainimport ("net/http""github.com/prometheus/client_golang/prometheus""github.com/prometheus/client_golang/prometheus/promhttp")var (requestsTotal = prometheus.NewCounter(prometheus.CounterOpts{Name: "app_requests_total",Help: "Total HTTP requests",}))func init() {prometheus.MustRegister(requestsTotal)}func main() {http.Handle("/metrics", promhttp.Handler())http.ListenAndServe(":8080", nil)}
promtool tsdb compact减少存储占用。Prometheus监控K8s集群的核心在于自动化发现、高效采集和智能告警。对于中小规模集群,可直接使用Helm部署Operator;大规模场景需结合Thanos实现水平扩展。建议定期审查告警规则,避免“告警疲劳”,同时通过Grafana构建业务看板,将监控数据转化为决策依据。
实践建议:
通过合理配置Prometheus,开发者可全面掌握K8s集群健康状态,为容器化应用的稳定运行提供坚实保障。