Ceph 存储运维完全指南 / 09 - 监控与告警

09 - 监控与告警

9.1 集群健康检查

ceph 命令行监控

# 集群状态概览（最常用）
ceph -s

输出解读：

  cluster:
    id:     xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    health: HEALTH_OK                    ← 健康状态（OK/WARN/ERR）

  services:
    mon: 3 daemons, quorum node1,node2,node3   ← Monitor 法定人数
    mgr: node1.active                           ← 活跃的 Manager
    mds: myfs:1 {0=node1=up:active}             ← MDS 状态
    osd: 12 osds: 12 up, 12 in                  ← OSD 状态

  task status:
    scrub status:
        mds.node1: idle

  data:
    pools:   5 pools, 500 pgs
    objects: 150.23k objects, 500 GiB
    usage:   1.5 TiB used, 8.5 TiB / 10 TiB avail   ← 容量使用
    pgs:     500 active+clean                        ← PG 状态

  io:
    client:   100 MiB/s rd, 50 MiB/s wr, 500 op/s rd, 200 op/s wr

健康状态详解

状态	含义	行动
HEALTH_OK	一切正常	无需操作
HEALTH_WARN	警告，可能有轻微问题	查看详情，关注趋势
HEALTH_ERR	严重错误，数据可能受影响	立即排查

# 查看健康详情
ceph health detail

# 实时监控（刷新模式）
ceph -w

# 查看容量使用
ceph df
ceph df detail

9.2 核心监控命令

OSD 监控

# OSD 状态概览
ceph osd stat
ceph osd tree

# OSD 使用率分布
ceph osd df
ceph osd df tree
ceph osd df tree sort utilization

# OSD 性能统计
ceph osd perf

# OSD 慢操作
ceph daemon osd.0 dump_ops_in_flight

# OSD 修复状态
ceph osd repair osd.0

PG 监控

# PG 状态统计
ceph pg stat

# 查看所有非 active+clean 的 PG
ceph pg ls | grep -v active+clean

# PG 诊断信息
ceph pg dump_stuck unclean
ceph pg dump_stuck inactive
ceph pg dump_stuck stale
ceph pg dump_stuck undersized

# PG 映射详情
ceph pg map <pgid>

存储池监控

# 池统计
ceph osd pool stats

# 池详情
ceph osd pool ls detail

# 池使用率
ceph df | grep -A 20 "POOLS"

9.3 Prometheus + Grafana 监控

启用 Prometheus 模块

# 启用 Prometheus 导出模块
ceph mgr module enable prometheus

# 验证端点
curl -s http://<mgr-node>:9287/metrics | head -20

# 默认端口 9287
# 如果需要修改端口
ceph config set mgr mgr/prometheus/server_port 9287
ceph config set mgr mgr/prometheus/server_addr 0.0.0.0

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'ceph'
    static_configs:
      - targets:
          - 'node1:9287'
          - 'node2:9287'
    metrics_path: /metrics

  # Node Exporter（主机指标）
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'

关键 Ceph 指标

指标名	说明	告警阈值建议
`ceph_health_status`	集群健康状态（0=OK,1=WARN,2=ERR）	> 0 告警
`ceph_osd_in`	OSD 是否 in	= 0 告警
`ceph_osd_up`	OSD 是否 up	= 0 告警
`ceph_pg_active`	active PG 数量	-
`ceph_pg_degraded`	degraded PG 数量	> 0 告警
`ceph_pg_stale`	stale PG 数量	> 0 告警
`ceph_pool_used_bytes`	池已用空间	-
`ceph_cluster_total_used_bytes`	集群总使用空间	> 80% 警告
`ceph_osd_op_r_latency`	OSD 读延迟	> 50ms 告警
`ceph_osd_op_w_latency`	OSD 写延迟	> 100ms 告警
`ceph_osd_recovery_ops`	恢复操作数	> 1000 告警

Grafana 仪表板

# 推荐的 Grafana 仪表板 ID（从 grafana.com 导入）
# 5336  - Ceph - Cluster Overview
# 5342  - Ceph - OSD Details
# 7056  - Ceph - Pools Overview
# 1471  - Ceph - Prometheus Overview

# 导入方式：
# Grafana → + → Import → 输入 Dashboard ID → Load → 选择 Prometheus 数据源

9.4 告警配置

Prometheus Alertmanager 规则

# ceph_alerts.yml
groups:
  - name: ceph_alerts
    rules:
      # 集群健康状态异常
      - alert: CephHealthError
        expr: ceph_health_status > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Ceph 集群状态异常"
          description: "Ceph 集群健康状态为 {{ $value }}（0=OK,1=WARN,2=ERR）"

      # OSD 故障
      - alert: CephOSDDown
        expr: ceph_osd_up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "OSD {{ $labels.ceph_daemon }} 故障"

      # PG 降级
      - alert: CephPGDegraded
        expr: ceph_pg_degraded > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PG 处于降级状态"
          description: "降级 PG 数量: {{ $value }}"

      # 容量告警
      - alert: CephCapacityWarning
        expr: (ceph_cluster_total_used_bytes / ceph_cluster_total_bytes) > 0.75
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Ceph 集群容量使用超过 75%"

      - alert: CephCapacityCritical
        expr: (ceph_cluster_total_used_bytes / ceph_cluster_total_bytes) > 0.85
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Ceph 集群容量使用超过 85%"

      # OSD 延迟告警
      - alert: CephOSDHighLatency
        expr: ceph_osd_op_r_latency > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "OSD {{ $labels.ceph_daemon }} 读延迟过高: {{ $value }}s"

      # 恢复中
      - alert: CephRecoveryInProgress
        expr: ceph_pg_backfilling > 0 or ceph_pg_recovering > 0
        for: 0m
        labels:
          severity: info
        annotations:
          summary: "Ceph 数据恢复中，PG 数: {{ $value }}"

9.5 Ceph Dashboard

# 查看 Dashboard URL
ceph mgr services

# 设置凭据
ceph dashboard set-login-credentials admin MySecureP@ss

# 启用 SSL（自签名证书）
ceph dashboard create-self-signed-cert

# 启用 Prometheus 集成
ceph dashboard set-prometheus-api-host http://prometheus:9090

# 启用 Grafana 集成
ceph dashboard set-grafana-api-url http://grafana:3000

9.6 自定义监控脚本

#!/bin/bash
# ceph_health_check.sh - Ceph 健康检查脚本

CEPH_CMD="ceph"
LOG_FILE="/var/log/ceph/health_check.log"

# 获取集群状态
HEALTH=$($CEPH_CMD health --format json | jq -r '.status')
OSD_UP=$($CEPH_CMD osd stat --format json | jq '.num_up_osds')
OSD_IN=$($CEPH_CMD osd stat --format json | jq '.num_in_osds')
PG_DEGRADED=$($CEPH_CMD pg stat --format json | jq '.num_pg_degraded')
USAGE_PCT=$($CEPH_CMD df --format json | jq '.stats.total_used_bytes / .stats.total_bytes * 100 | floor')

echo "=== Ceph Health Check - $(date) ===" | tee -a $LOG_FILE
echo "Health: $HEALTH" | tee -a $LOG_FILE
echo "OSD Up/In: $OSD_UP/$OSD_IN" | tee -a $LOG_FILE
echo "PG Degraded: $PG_DEGRADED" | tee -a $LOG_FILE
echo "Usage: ${USAGE_PCT}%" | tee -a $LOG_FILE

# 告警判断
if [ "$HEALTH" != "HEALTH_OK" ]; then
    echo "ALERT: Cluster health is $HEALTH" | tee -a $LOG_FILE
    # 发送告警（邮件/Webhook/钉钉）
fi

if [ $PG_DEGRADED -gt 0 ]; then
    echo "ALERT: $PG_DEGRADED degraded PGs" | tee -a $LOG_FILE
fi

if [ $USAGE_PCT -gt 80 ]; then
    echo "WARNING: Disk usage at ${USAGE_PCT}%" | tee -a $LOG_FILE
fi

9.7 日志管理

# Ceph 日志位置
# /var/log/ceph/ceph-mon.*.log
# /var/log/ceph/ceph-osd.*.log
# /var/log/ceph/ceph-mgr.*.log

# 实时查看日志
tail -f /var/log/ceph/ceph-mon.node1.log
journalctl -u ceph-osd@0 -f

# 集群日志收集
ceph crash ls
ceph crash info <crash-id>
ceph crash archive <crash-id>    # 归档（不再告警）
ceph crash archive-all           # 归档所有

# 日志级别调整（临时）
ceph tell osd.0 config set debug_osd 0/5
ceph tell mon.node1 config set debug_mon 0/5

# 日志级别调整（永久）
ceph config set osd debug_osd 0/5

9.8 业务场景：日常巡检

#!/bin/bash
# ceph_daily_check.sh

echo "==================== Ceph Daily Report ===================="
echo "Date: $(date)"
echo ""

echo "--- Cluster Status ---"
ceph -s

echo ""
echo "--- OSD Utilization ---"
ceph osd df tree | head -20

echo ""
echo "--- Pool Usage ---"
ceph df | grep -A 30 "POOLS"

echo ""
echo "--- Stuck PGs ---"
ceph pg dump_stuck unclean 2>/dev/null | head -5
ceph pg dump_stuck inactive 2>/dev/null | head -5
ceph pg dump_stuck stale 2>/dev/null | head -5

echo ""
echo "--- Slow OSDs ---"
ceph osd perf | sort -k3 -rn | head -10

echo ""
echo "--- Recent Crashes ---"
ceph crash ls | head -10

echo ""
echo "==================== End of Report ===================="

扩展阅读

下一章：10 - 性能调优 — 学习 PG 数量优化、OSD 调优、BlueStore 配置和网络优化策略。