简介:本文详细介绍了Prometheus监控系统的搭建、配置与使用方法,涵盖单机部署、集群方案、数据采集、告警规则、可视化展示及常见问题解决,帮助读者快速掌握Prometheus的核心功能与实战技巧。
Prometheus 是一款由 SoundCloud 开源的监控告警系统,自 2012 年诞生以来,凭借其强大的数据模型、灵活的查询语言(PromQL)和高效的存储机制,已成为云原生时代监控领域的标准工具。其核心设计理念围绕“多维度数据采集”和“实时告警”展开,尤其适合动态环境(如 Kubernetes)的监控需求。
# 下载并解压wget https://github.com/prometheus/prometheus/releases/download/v2.47.2/prometheus-2.47.2.linux-amd64.tar.gztar -xzf prometheus-*.tar.gzcd prometheus-*# 修改配置文件cat > prometheus.yml <<EOFglobal:scrape_interval: 15sscrape_configs:- job_name: 'prometheus'static_configs:- targets: ['localhost:9090']EOF# 启动服务./prometheus --config.file=prometheus.yml
docker run -d --name prometheus \-p 9090:9090 \-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \prom/prometheus:v2.47.2
对于生产环境,推荐采用以下架构:
示例:Thanos 集成
# prometheus.yml 配置远程写入remote_write:- url: "http://thanos-receiver:19291/api/v1/receive"
scrape_configs:- job_name: 'node-exporter'static_configs:- targets: ['192.168.1.100:9100', '192.168.1.101:9100']labels:cluster: 'prod'
scrape_configs:- job_name: 'kubernetes-pods'kubernetes_sd_configs:- role: podrelabel_configs:- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]action: keepregex: true
在 prometheus.yml 中引用告警规则文件:
rule_files:- 'alert.rules.yml'
示例告警规则:
groups:- name: node-alertsrules:- alert: HighCPUUsageexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 10mlabels:severity: criticalannotations:summary: "High CPU usage on {{ $labels.instance }}"description: "CPU usage is above 80% for more than 10 minutes."
# alertmanager.ymlroute:group_by: ['alertname']receiver: 'email'receivers:- name: 'email'email_configs:- to: 'admin@example.com'from: 'alert@example.com'smarthost: smtp.example.com:587auth_username: 'user'auth_password: 'pass'
docker run -d --name alertmanager \-p 9093:9093 \-v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml \prom/alertmanager:v0.26.0
添加 Prometheus 数据源:
Configuration > Data Sourceshttp://prometheus:9090)导入官方仪表盘:
1860(Node Exporter 全景仪表盘)
sum(rate(node_cpu_seconds_total{mode="user"}[5m])) by (instance)
label_values(job) 实现动态下拉菜单。Value mappings。instance 和 job 标签进行交互式分析。--storage.tsdb.retention.compression.enabled=true)scrape_interval: 30s)--web.enable-admin-api 监控 Prometheus 自身指标。up{job="<job-name>"} == 0process_resident_memory_bytes--web.config.file 配置证书)。使用 Go 编写 Node Exporter 风格的指标:
package mainimport ("net/http""github.com/prometheus/client_golang/prometheus""github.com/prometheus/client_golang/prometheus/promhttp")var (customMetric = prometheus.NewGauge(prometheus.GaugeOpts{Name: "custom_metric_value",Help: "Example of a custom metric",}))func init() {prometheus.MustRegister(customMetric)customMetric.Set(42) // 设置初始值}func main() {http.Handle("/metrics", promhttp.Handler())http.ListenAndServe(":8080", nil)}
通过 Promtail 采集日志并关联指标:
# promtail-config.ymlscrape_configs:- job_name: systemstatic_configs:- targets: [localhost]labels:job: varlogs__path__: /var/log/*log
监控分层设计:
告警策略原则:
for 持续时间容量规划:
保留天数 × 指标数 × 采样点数通过以上步骤,您已掌握 Prometheus 从部署到高级使用的完整流程。建议结合实际业务场景持续优化监控策略,并关注社区更新(如 Prometheus 3.0 的新特性)。