强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

Memcached 完全指南 / 第 12 章:监控与告警

第 12 章:监控与告警

12.1 监控指标体系

关键指标分类

类别指标告警阈值说明
健康uptime< 60s刚重启
命中率hit_rate< 80%缓存效率低
内存mem_used / mem_limit> 90%内存即将耗尽
连接curr_connections / max_connections> 80%连接数接近上限
淘汰evictions> 0 持续增长内存不足导致淘汰
QPScmd_get + cmd_set取决于基线流量异常
延迟响应时间 P99> 5ms性能劣化
Slabslab_automove异常Slab 分配问题

12.2 stats 命令详解

基本统计

echo "stats" | nc localhost 11211
指标说明重要程度
pid进程 ID
uptime运行时间(秒)★★★
time当前 Unix 时间戳
version版本号★★
libeventlibevent 版本
pointer_size指针位数
rusage_user用户态 CPU 时间★★
rusage_system内核态 CPU 时间★★
curr_connections当前连接数★★★★★
total_connections累计连接数★★
connection_structures已分配的连接结构数★★
rejected_connections被拒绝的连接数★★★★
cmd_getGET 请求数★★★★
cmd_setSET 请求数★★★★
cmd_flushFLUSH 请求数★★★
cmd_touchTOUCH 请求数★★
get_hitsGET 命中数★★★★★
get_missesGET 未命中数★★★★★
get_expiredGET 过期数★★★
get_flushedGET 被 flush 数★★
delete_missesDELETE 未命中数★★
delete_hitsDELETE 命中数★★
incr_missesINCR 未命中数★★
incr_hitsINCR 命中数★★
decr_missesDECR 未命中数★★
decr_hitsDECR 命中数★★
cas_missesCAS 未命中数★★
cas_hitsCAS 命中数★★
cas_badvalCAS 值不匹配数★★
bytes当前存储字节数★★★★★
limit_maxbytes最大内存限制★★★★★
curr_items当前 Item 数★★★★
total_items累计 Item 数★★★
evictions淘汰次数★★★★★
bytes_read读取字节数★★
bytes_written写入字节数★★
threadsWorker 线程数★★
hash_power_level哈希表幂次★★
hash_bytes哈希表字节数★★
hash_is_expanding哈希表是否扩展中★★
slab_reassign_runningSlab 迁移是否运行中★★
slabs_movedSlab 迁移次数★★

计算命中率

#!/bin/bash
# 计算 Memcached 命中率
STATS=$(echo "stats" | nc localhost 11211)
HITS=$(echo "$STATS" | grep "get_hits" | awk '{print $3}')
MISSES=$(echo "$STATS" | grep "get_misses" | awk '{print $3}')
TOTAL=$((HITS + MISSES))

if [ $TOTAL -gt 0 ]; then
    HIT_RATE=$(echo "scale=2; $HITS * 100 / $TOTAL" | bc)
    echo "命中率: ${HIT_RATE}%"
    echo "命中: $HITS, 未命中: $MISSES, 总计: $TOTAL"
else
    echo "暂无请求数据"
fi

Item 统计

echo "stats items" | nc localhost 11211
# STAT items:1:number 523
# STAT items:1:number_hot 100
# STAT items:1:number_warm 150
# STAT items:1:number_cold 250
# STAT items:1:number_temp 23
# STAT items:1:age 1234
# STAT items:1:evicted 50
# STAT items:1:evicted_nonzero 40
# STAT items:1:evicted_time 300
# STAT items:1:outofmemory 5
# STAT items:1:tailrepairs 10

Slab 统计

echo "stats slabs" | nc localhost 11211

Settings 统计

echo "stats settings" | nc localhost 11211
# STAT maxbytes 134217728
# STAT maxconns 1024
# STAT tcpport 11211
# STAT udpport 0
# STAT inter 127.0.0.1
# STAT verbosity 0
# STAT oldest 0
# STAT evictions on
# STAT domain_socket NULL
# STAT umask 700
# STAT growth_factor 1.25
# STAT chunk_size 48
# STAT num_threads 4
# STAT num_threads_per_udp 4
# STAT stat_key_prefix :
# STAT detail_enabled no
# STAT reqs_per_event 20
# STAT cas_enabled yes
# STAT tcp_backlog 1024
# STAT binding_protocol auto-negotiate
# STAT auth_enabled_sasl no
# STAT item_size_max 1048576
# STAT maxconns_fast yes
# STAT hashpower_init 0
# STAT slab_reassign yes
# STAT slab_automove 1
# STAT lru_maintainer_thread yes
# STAT lru_crawler no
# STAT lru_crawler_sleep 100
# STAT lru_crawler_tocrawl 0
# STAT tail_repair_time 0
# STAT flush_enabled yes
# STAT dump_flawed no
# STAT hash_algorithm murmur3

12.3 Prometheus + Grafana 监控

方案架构

┌──────────────┐     ┌─────────────────────┐     ┌──────────┐
│  Memcached   │────▶│  Exporter           │────▶│Prometheus│
│  :11211      │stats│  (memcached_exporter)│     │  :9090   │
└──────────────┘     │  :9150              │     └────┬─────┘
                     └─────────────────────┘          │
                                                      ▼
                                                ┌──────────┐
                                                │ Grafana  │
                                                │  :3000   │
                                                └──────────┘

部署 Memcached Exporter

# Docker 方式
docker run -d --name memcached-exporter \
    -p 9150:9150 \
    prom/memcached-exporter \
    --memcached.address=memcached:11211

# 或使用二进制
wget https://github.com/prometheus/memcached_exporter/releases/download/v0.14.4/memcached_exporter-0.14.4.linux-amd64.tar.gz
tar xzf memcached_exporter-0.14.4.linux-amd64.tar.gz
./memcached_exporter --memcached.address=localhost:11211

Prometheus 配置

# prometheus.yml
scrape_configs:
  - job_name: 'memcached'
    static_configs:
      - targets:
          - 'mc-exporter1:9150'
          - 'mc-exporter2:9150'
          - 'mc-exporter3:9150'
    scrape_interval: 15s
    scrape_timeout: 10s

核心 Exporter 指标

Prometheus 指标含义类型
memcached_up实例是否存活gauge
memcached_items_totalItem 总数gauge
memcached_current_bytes当前使用字节数gauge
memcached_limit_bytes内存限制gauge
memcached_commands_total命令总数(按类型)counter
memcached_connections_total连接数gauge
memcached_current_items当前 Item 数gauge
memcached_evictions_total淘汰总数counter
memcached_slab_chunk_sizeSlab chunk 大小gauge
memcached_slab_chunks_freeSlab 空闲 chunkgauge
memcached_slab_chunks_usedSlab 已用 chunkgauge

Grafana 仪表盘

推荐使用社区提供的模板:

# 导入 Grafana 仪表盘 ID: 11987 (Memcached Overview)
# 或 ID: 2279 (Memcached Full)

常用 PromQL 查询

# 命中率
sum(rate(memcached_commands_total{command="get",status="hit"}[5m]))
/
sum(rate(memcached_commands_total{command="get"}[5m]))
* 100

# QPS
sum(rate(memcached_commands_total[5m]))

# 内存使用率
memcached_current_bytes / memcached_limit_bytes * 100

# 连接使用率
memcached_current_connections / memcached_max_connections * 100

# 淘汰速率
rate(memcached_evictions_total[5m])

# 各命令 QPS
sum by (command) (rate(memcached_commands_total[5m]))

12.4 告警规则

Prometheus AlertManager 规则

# memcached_alerts.yml
groups:
  - name: memcached
    rules:
      # 实例宕机
      - alert: MemcachedDown
        expr: memcached_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Memcached 实例宕机"
          description: "{{ $labels.instance }} 已宕机超过 1 分钟"

      # 命中率低
      - alert: MemcachedHitRateLow
        expr: |
          sum(rate(memcached_commands_total{command="get",status="hit"}[5m]))
          / sum(rate(memcached_commands_total{command="get"}[5m]))
          * 100 < 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memcached 命中率低于 80%"
          description: "当前命中率: {{ $value }}%"

      # 内存使用率高
      - alert: MemcachedMemoryHigh
        expr: memcached_current_bytes / memcached_limit_bytes * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memcached 内存使用率超过 90%"

      # 连接数接近上限
      - alert: MemcachedConnectionsHigh
        expr: memcached_current_connections / memcached_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memcached 连接数超过 80%"

      # 持续淘汰
      - alert: MemcachedEvictions
        expr: rate(memcached_evictions_total[5m]) > 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memcached 持续淘汰数据"
          description: "淘汰速率: {{ $value }}/s"

12.5 自定义监控脚本

完整监控脚本

#!/usr/bin/env python3
"""Memcached 监控脚本"""
import socket
import time
import json
import sys

def get_stats(host='localhost', port=11211):
    s = socket.socket()
    s.settimeout(5)
    s.connect((host, port))
    s.send(b'stats\r\n')
    data = b''
    while True:
        chunk = s.recv(4096)
        data += chunk
        if b'END\r\n' in chunk:
            break
    s.close()

    stats = {}
    for line in data.decode().split('\r\n'):
        if line.startswith('STAT '):
            parts = line.split()
            stats[parts[1]] = parts[2]
    return stats

def check_health(stats):
    alerts = []

    # 命中率
    hits = int(stats.get('get_hits', 0))
    misses = int(stats.get('get_misses', 0))
    total = hits + misses
    if total > 0:
        hit_rate = hits / total * 100
        if hit_rate < 80:
            alerts.append(f"命中率过低: {hit_rate:.1f}%")

    # 内存使用率
    used = int(stats.get('bytes', 0))
    limit = int(stats.get('limit_maxbytes', 1))
    mem_pct = used / limit * 100
    if mem_pct > 90:
        alerts.append(f"内存使用率过高: {mem_pct:.1f}%")

    # 连接数
    curr_conn = int(stats.get('curr_connections', 0))
    max_conn = int(stats.get('max_connections', 1))
    conn_pct = curr_conn / max_conn * 100
    if conn_pct > 80:
        alerts.append(f"连接数过高: {conn_pct:.1f}%")

    # 淘汰
    evictions = int(stats.get('evictions', 0))
    if evictions > 0:
        alerts.append(f"存在淘汰: {evictions}")

    # 拒绝连接
    rejected = int(stats.get('rejected_connections', 0))
    if rejected > 0:
        alerts.append(f"存在拒绝连接: {rejected}")

    return alerts

def print_report(stats):
    hits = int(stats.get('get_hits', 0))
    misses = int(stats.get('get_misses', 0))
    total = hits + misses
    hit_rate = (hits / total * 100) if total > 0 else 0

    print(f"""
Memcached 监控报告
═══════════════════════════════════
版本:     {stats.get('version', 'N/A')}
运行时间: {int(stats.get('uptime', 0)) // 3600} 小时
线程数:   {stats.get('threads', 'N/A')}

━━ 命中率 ━━━━━━━━━━━━━━━━━━━━━━
命中率:   {hit_rate:.2f}%
命中数:   {hits}
未命中数: {misses}

━━ 内存 ━━━━━━━━━━━━━━━━━━━━━━━
已用:     {int(stats.get('bytes', 0)) / 1048576:.1f} MB
上限:     {int(stats.get('limit_maxbytes', 0)) / 1048576:.1f} MB
使用率:   {int(stats.get('bytes', 0)) / max(int(stats.get('limit_maxbytes', 1)), 1) * 100:.1f}%
Item 数:  {stats.get('curr_items', 'N/A')}

━━ 流量 ━━━━━━━━━━━━━━━━━━━━━━━
GET:      {stats.get('cmd_get', 'N/A')}
SET:      {stats.get('cmd_set', 'N/A')}
DELETE:   {stats.get('cmd_delete', 'N/A')}
INCR:     {stats.get('cmd_incr', 'N/A')}
DECR:     {stats.get('cmd_decr', 'N/A')}

━━ 连接 ━━━━━━━━━━━━━━━━━━━━━━━
当前连接: {stats.get('curr_connections', 'N/A')}
最大连接: {stats.get('max_connections', 'N/A')}
被拒绝:   {stats.get('rejected_connections', 'N/A')}

━━ 淘汰 ━━━━━━━━━━━━━━━━━━━━━━━
淘汰数:   {stats.get('evictions', 'N/A')}
""")

if __name__ == '__main__':
    host = sys.argv[1] if len(sys.argv) > 1 else 'localhost'
    port = int(sys.argv[2]) if len(sys.argv) > 2 else 11211

    stats = get_stats(host, port)
    print_report(stats)

    alerts = check_health(stats)
    if alerts:
        print("⚠️  告警:")
        for a in alerts:
            print(f"  - {a}")
    else:
        print("✅ 状态正常")

12.6 日志分析

启用详细日志

# 启动时设置日志级别
memcached -vv   # 详细日志(显示每次 get/set)
memcached -vvv  # 非常详细(调试用)

# 运行时调整
echo "verbosity 2" | nc localhost 11211

日志级别

级别参数内容
0-v错误和关键信息
1-vv添加连接/断开信息
2-vvv添加每次命令执行

扩展阅读

小结

要点内容
核心指标命中率、内存使用率、连接数、淘汰数
命中率get_hits / (get_hits + get_misses),保持 > 80%
推荐方案Prometheus + memcached_exporter + Grafana
告警阈值内存 > 90%、连接 > 80%、命中率 < 80%、淘汰 > 0