简介:本文详细介绍Prometheus与Grafana监控服务的安装、配置及使用方法,帮助开发者快速构建完整的监控体系,覆盖单机部署、集群配置及常见场景实践。
在分布式系统与微服务架构盛行的当下,传统监控方案(如Zabbix、Nagios)已难以满足动态扩展与高维数据聚合的需求。Prometheus作为CNCF(云原生计算基金会)毕业项目,凭借其拉取式监控模型、多维数据模型及PromQL查询语言,成为Kubernetes生态监控的事实标准。而Grafana作为可视化利器,支持70+数据源接入,提供丰富的仪表盘模板与告警规则配置,两者结合可构建从数据采集到可视化的完整闭环。
setenforce 0)并配置防火墙放行9090(Prometheus)、3000(Grafana)端口。wget、tar、systemd及docker(容器化部署场景)。
# 下载最新稳定版(以2.47.2为例)wget https://github.com/prometheus/prometheus/releases/download/v2.47.2/prometheus-2.47.2.linux-amd64.tar.gztar xvf prometheus-*.tar.gzcd prometheus-*# 配置systemd服务cat <<EOF | sudo tee /etc/systemd/system/prometheus.service[Unit]Description=Prometheus Monitoring SystemAfter=network.target[Service]Type=simpleUser=prometheusExecStart=/usr/local/bin/prometheus \--config.file=/etc/prometheus/prometheus.yml \--storage.tsdb.path=/var/lib/prometheus/ \--web.console.templates=/etc/prometheus/consoles \--web.console.libraries=/etc/prometheus/console_librariesRestart=on-failure[Install]WantedBy=multi-user.targetEOF# 创建用户与目录sudo useradd --no-create-home --shell /bin/false prometheussudo mkdir -p /etc/prometheus /var/lib/prometheussudo cp prometheus promtool /usr/local/bin/sudo cp prometheus.yml /etc/prometheus/sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus# 启动服务sudo systemctl daemon-reloadsudo systemctl enable --now prometheus
/etc/prometheus/prometheus.yml核心配置示例:
global:scrape_interval: 15s # 全局采集间隔evaluation_interval: 15s # 告警规则评估间隔scrape_configs:- job_name: 'node_exporter'static_configs:- targets: ['localhost:9100'] # Node Exporter目标- job_name: 'prometheus'static_configs:- targets: ['localhost:9090'] # 自监控
# 下载并安装(Ubuntu示例)wget https://dl.grafana.com/oss/release/grafana_9.5.6_amd64.debsudo apt install ./grafana_*.deb# 启动服务sudo systemctl enable --now grafana-server
访问http://<IP>:3000,默认用户名/密码为admin/admin,首次登录需修改密码。
用于采集主机级指标(CPU、内存、磁盘等):
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gztar xvf node_exporter-*.tar.gzsudo cp node_exporter-*/node_exporter /usr/local/bin/# 创建systemd服务cat <<EOF | sudo tee /etc/systemd/system/node_exporter.service[Unit]Description=Node ExporterAfter=network.target[Service]User=node_exporterExecStart=/usr/local/bin/node_exporterRestart=on-failure[Install]WantedBy=multi-user.targetEOFsudo useradd --no-create-home --shell /bin/false node_exportersudo systemctl enable --now node_exporter
在Prometheus配置中添加scrape_configs后重启服务。
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gztar xvf alertmanager-*.tar.gzsudo cp alertmanager-*/alertmanager /usr/local/bin/# 配置systemd服务(类似Prometheus)
在Prometheus配置中添加rule_files:
rule_files:- '/etc/prometheus/alert.rules.yml'
示例规则文件:
groups:- name: node.rulesrules:- alert: NodeCPUUsageexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for: 10mlabels:severity: warningannotations:summary: "High CPU usage on {{ $labels.instance }}"description: "CPU usage is above 80% (current value: {{ $value }}%)"
/etc/alertmanager/alertmanager.yml示例:
global:resolve_timeout: 5mroute:receiver: emailgroup_by: ['alertname']group_wait: 10sgroup_interval: 10srepeat_interval: 1hreceivers:- name: emailemail_configs:- to: admin@example.comfrom: alert@example.comsmarthost: smtp.example.com:587auth_username: "user"auth_password: "password"
登录Grafana后,依次点击Configuration → Data Sources → Add data source,选择Prometheus,填入URL(如http://localhost:9090),点击Save & Test。
8919为Node Exporter官方仪表盘)。
rate(node_cpu_seconds_total{mode="user"}[5m]) * 100
在Panel的Alert选项卡中创建告警:
Query A > 0.8)。通过federation实现多层级数据聚合:
# 上级Prometheus配置scrape_configs:- job_name: 'federate'honor_labels: truemetrics_path: '/federate'params:'match[]':- '{job="node_exporter"}'static_configs:- targets: ['<下级Prometheus-IP>:9090']
集成对象存储(如S3、MinIO)实现历史数据查询:
# prometheus.yml中添加Thanos侧车storage:tsdb:path: /var/lib/prometheusretention.time: 72hthanos:sidecar:object-storage-config:type: s3config:bucket: "prometheus-data"endpoint: "minio.example.com"access_key: "minio"secret_key: "minio123"
--storage.tsdb.retention.time(默认15天)。--web.enable-admin-api进行手动清理。scrape_interval与目标响应时间。record_rules预计算常用指标。--web.external-url与--web.route-prefix避免路径冲突。本文系统阐述了Prometheus与Grafana的安装、配置及扩展方法,覆盖从单机部署到高可用集群的全流程。通过实际案例展示了指标采集、告警规则定义及可视化仪表盘构建的关键步骤。开发者可根据实际场景选择二进制包或容器化部署,并结合Thanos、Alertmanager等组件构建企业级监控体系。建议定期审查PromQL查询效率,优化存储策略以应对大规模监控需求。