Docker Compose 完全指南 / 第 13 章 · 监控:cAdvisor、Prometheus 与 Grafana 集成
第 13 章 · 监控与可观测性
13.1 可观测性三大支柱
| 支柱 | 工具 | 说明 |
|---|---|---|
| 指标 (Metrics) | Prometheus + Grafana | 数值型时间序列数据 |
| 日志 (Logs) | Loki + Grafana(见第 12 章) | 离散事件记录 |
| 追踪 (Traces) | Jaeger / Tempo | 请求链路追踪 |
本章聚焦指标监控——通过 Prometheus 收集指标,Grafana 展示仪表盘。
13.2 监控架构
┌──────────────────────────────────────────────────────┐
│ Docker 宿主机 │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ app-1 │ │ app-2 │ │ app-3 │ │
│ │ /metrics│ │ /metrics│ │ /metrics│ ← 应用暴露 │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Prometheus │ ← 定期拉取指标 │
│ │ (tsdb) │ │
│ └───────┬────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ │ │ │
│ ┌────▼────┐ ┌─────▼────┐ ┌────▼─────┐ │
│ │ cAdvisor│ │ Node │ │ Redis │ │
│ │ 容器指标 │ │ Exporter │ │ Exporter │ │
│ └─────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Grafana │ ← 可视化仪表盘 │
│ │ (Dashboard) │ │
│ └────────────────┘ │
└──────────────────────────────────────────────────────┘
13.3 cAdvisor
cAdvisor(Container Advisor)是 Google 开源的容器资源监控工具,自动采集容器的 CPU、内存、网络、磁盘等指标。
基本配置
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
restart: unless-stopped
⚠️ 安全提示:cAdvisor 需要
privileged权限来读取宿主机信息。在生产环境中需要评估安全风险。
cAdvisor 暴露的指标
| 指标类别 | 示例指标 |
|---|---|
| CPU | container_cpu_usage_seconds_total |
| 内存 | container_memory_usage_bytes、container_memory_working_set_bytes |
| 网络 | container_network_receive_bytes_total、container_network_transmit_bytes_total |
| 磁盘 | container_fs_usage_bytes、container_fs_reads_total |
| 任务 | container_tasks_state |
13.4 Prometheus
Prometheus 是 CNCF 毕业项目,采用拉取(Pull)模式采集指标。
基本配置
services:
prometheus:
image: prom/prometheus:v2.52.0
ports:
- "9090:9090"
volumes:
- prometheus-data:/prometheus
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
prometheus.yml 配置文件
# prometheus.yml
global:
scrape_interval: 15s # 全局抓取间隔
evaluation_interval: 15s # 规则评估间隔
# 告警规则文件
rule_files:
- "rules/*.yml"
# 抓取配置
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# cAdvisor — 容器指标
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# 过滤不需要的指标
metric_relabel_configs:
- source_labels: [__name__]
regex: 'container_(cpu|memory|network|fs)_.*'
action: keep
# Node Exporter — 宿主机指标
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# 应用自定义指标
- job_name: 'app'
static_configs:
- targets: ['app:3000']
metrics_path: '/metrics'
scrape_interval: 10s # 服务级别的抓取间隔
# 使用 Docker 服务发现
- job_name: 'docker'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 10s
relabel_configs:
# 只抓取有 prometheus.scrape=true 标签的容器
- source_labels: [__meta_docker_container_label_prometheus_scrape]
regex: 'true'
action: keep
- source_labels: [__meta_docker_container_label_prometheus_port]
regex: (.+)
target_label: __address__
replacement: '${1}'
13.5 Node Exporter
Node Exporter 采集宿主机级别的指标(CPU、内存、磁盘、网络)。
services:
node-exporter:
image: prom/node-exporter:v1.8.0
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
13.6 Grafana
Grafana 是业界最流行的可视化工具,支持多种数据源。
基本配置
services:
grafana:
image: grafana/grafana:11.0.0
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
restart: unless-stopped
自动配置数据源
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
自动导入仪表盘
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: true
常用社区仪表盘 ID
| Dashboard | ID | 说明 |
|---|---|---|
| Docker 容器监控 | 893 | cAdvisor 全景 |
| Node Exporter Full | 1860 | 宿主机全景 |
| Prometheus 2.0 概览 | 3662 | Prometheus 自身 |
| Redis 仪表盘 | 763 | Redis 监控 |
| PostgreSQL 仪表盘 | 9628 | 数据库监控 |
| Nginx 仪表盘 | 12708 | Web 服务器监控 |
# 在 Grafana 中导入:
# 1. 访问 http://localhost:3000
# 2. 左侧菜单 → Dashboards → Import
# 3. 输入 Dashboard ID → Load
# 4. 选择 Prometheus 数据源 → Import
13.7 完整监控栈
# compose.monitoring.yaml
services:
# ===== 监控组件 =====
prometheus:
image: prom/prometheus:v2.52.0
ports:
- "9090:9090"
volumes:
- prometheus-data:/prometheus
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/rules:/etc/prometheus/rules:ro
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:11.0.0
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning:ro
depends_on:
- prometheus
restart: unless-stopped
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
privileged: true
devices:
- /dev/kmsg
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.8.0
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
restart: unless-stopped
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
restart: unless-stopped
networks:
- monitoring
# ===== 应用服务(带监控标签)=====
app:
image: myapp:latest
labels:
prometheus.scrape: "true"
prometheus.port: "3000"
prometheus.path: "/metrics"
networks:
- monitoring
- app-net
networks:
monitoring:
app-net:
volumes:
prometheus-data:
grafana-data:
13.8 应用指标暴露
各语言 Prometheus 客户端
| 语言 | 库 |
|---|---|
| Go | prometheus/client_golang |
| Python | prometheus_client |
| Node.js | prom-client |
| Java | micrometer-registry-prometheus |
Python 示例
from prometheus_client import Counter, Histogram, generate_latest
from flask import Flask, Response
import time
app = Flask(__name__)
# 定义指标
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
duration = time.time() - request.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.path,
status=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.path
).observe(duration)
return response
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype='text/plain')
@app.route('/health')
def health():
return {'status': 'healthy'}
13.9 告警配置
Prometheus 告警规则
# monitoring/rules/alerts.yml
groups:
- name: container_alerts
rules:
# 容器 CPU 使用率过高
- alert: ContainerHighCPU
expr: rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "容器 {{ $labels.name }} CPU 使用率超过 80%"
# 容器内存使用率过高
- alert: ContainerHighMemory
expr: container_memory_working_set_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "容器 {{ $labels.name }} 内存使用率超过 85%"
# 容器重启
- alert: ContainerRestarting
expr: increase(container_restart_count{name=~".+"}[15m]) > 3
labels:
severity: critical
annotations:
summary: "容器 {{ $labels.name }} 在 15 分钟内重启超过 3 次"
# 容器停止
- alert: ContainerDown
expr: absent(container_memory_working_set_bytes{name=~".+"})
for: 1m
labels:
severity: critical
annotations:
summary: "容器 {{ $labels.name }} 已停止"
Alertmanager 配置
# monitoring/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pager'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://notification-service:8080/webhook'
- name: 'pager'
webhook_configs:
- url: 'http://notification-service:8080/webhook?priority=high'
# 或邮件、Slack、钉钉等
# slack_configs:
# - api_url: 'https://hooks.slack.com/...'
# channel: '#alerts'
13.10 监控最佳实践
| 实践 | 说明 |
|---|---|
| 先基础设施后应用 | cAdvisor + Node Exporter → 应用指标 |
| 合理设置抓取间隔 | 基础设施 15-30s,关键应用 5-10s |
| 指标标签控制 | 不要过度使用高基数标签(如 user_id) |
| 数据保留策略 | 开发 7 天,生产 30-90 天 |
| 仪表盘分层 | 总览 → 服务 → 容器 → 实例 |
| 告警分级 | critical(立即处理)、warning(1小时内)、info(知悉) |
| 容量规划 | 监控 Prometheus 自身的存储和性能 |
13.11 小结
| 概念 | 说明 |
|---|---|
| cAdvisor | 容器指标采集(CPU、内存、网络、磁盘) |
| Prometheus | 指标存储与查询,Pull 模式 |
| Grafana | 可视化仪表盘,多数据源支持 |
| Node Exporter | 宿主机指标采集 |
| Alertmanager | 告警管理与通知路由 |
| Docker 服务发现 | 自动发现有标签的容器 |
| 应用指标 | 各语言 Prometheus 客户端暴露 /metrics |
扩展阅读
上一章:第 12 章 · 日志 ← | 下一章:第 14 章 · 故障排查 →