rqlite 完全指南 / 第 13 章：故障排查

第 13 章：故障排查

掌握 rqlite 常见问题的诊断方法和解决方案。

13.1 故障排查流程

发现问题
    │
    ▼
检查节点状态 ──────────────────────────────┐
    │                                      │
    ├── 节点在线?                           │
    │   ├── 是 → 检查集群状态               │
    │   └── 否 → 检查进程/日志（13.2 节）    │
    │                                      │
    ├── 有 Leader?                          │
    │   ├── 是 → 检查数据一致性              │
    │   └── 否 → 检查 Raft 选举（13.3 节）   │
    │                                      │
    ├── 写入成功?                           │
    │   ├── 是 → 问题可能在客户端            │
    │   └── 否 → 检查错误信息（13.4 节）     │
    │                                      │
    └── 读取正确?                           │
        ├── 是 → 问题已解决                  │
        └── 否 → 检查一致性级别（13.5 节）   │
                                               ▼
                                        查阅日志和指标

13.2 节点启动失败

问题：rqlited 无法启动

# 检查进程状态
systemctl status rqlited
journalctl -u rqlited --no-pager -n 50

# 检查数据目录权限
ls -la /var/lib/rqlite/
ls -la /var/lib/rqlite/data/

# 检查端口占用
ss -tlnp | grep 4001
ss -tlnp | grep 4002

错误信息	原因	解决方案
`permission denied`	数据目录权限不对	`chown -R rqlite:rqlite /var/lib/rqlite`
`address already in use`	端口被占用	更换端口或停止占用进程
`cannot open database`	数据文件损坏	从备份恢复或清除数据目录
`join failed: node already part of cluster`	重复加入	清除数据后重新加入

问题：数据文件损坏恢复

# 停止服务
systemctl stop rqlited

# 检查 SQLite 数据库完整性
sqlite3 /var/lib/rqlite/data/db.sqlite "PRAGMA integrity_check;"

# 如果输出 "ok" 则数据库正常
# 如果输出错误信息，则需要从备份恢复

# 方法 1: 清除数据，从集群重新同步
rm -rf /var/lib/rqlite/data/*
systemctl start rqlited
# 使用 -join 参数重新加入集群

# 方法 2: 从备份恢复
curl -XPOST 'localhost:4001/db/load' \
    -H 'Content-Type: text/plain' \
    --data-binary @backup.sql

13.3 集群选举问题

问题：集群无 Leader

# 查看所有节点的 Raft 状态
for port in 4001 4011 4021; do
    echo "--- Port $port ---"
    curl -s "localhost:$port/status" | python3 -c "
import json, sys
try:
    d = json.load(sys.stdin)
    store = d.get('store', {})
    print(f'  Node ID: {store.get(\"node_id\")}')
    print(f'  State: {store.get(\"raft_state\")}')
    print(f'  Term: {store.get(\"term\")}')
    print(f'  Applied Index: {store.get(\"applied_index\")}')
    print(f'  Last Contact: {store.get(\"last_contact\")}')
except:
    print('  Node unreachable')
" 2>/dev/null
done

可能原因及解决方案：

原因	现象	解决方案
网络分区	部分节点 unreachable	检查网络连通性
多数节点宕机	超过半数节点不可用	恢复至少一个宕机节点
选举参数不当	频繁选举超时	调整 HeartbeatTimeout
磁盘满	写入失败	清理磁盘空间

# 手动检查节点间 Raft 端口连通性
for from_port in 4001 4011; do
    for to_port in 4002 4012 4022; do
        echo "Testing $from_port -> $to_port"
        nc -z -w 2 localhost $to_port && echo "  OK" || echo "  FAIL"
    done
done

问题：Leader 频繁切换

# 查看最近的日志中 Leader 变化
journalctl -u rqlited --since "1 hour ago" | grep -i "leader"

原因	解决方案
网络抖动	检查网络质量，考虑调大超时
磁盘 I/O 慢	检查磁盘延迟，使用 SSD
负载过高	优化查询，增加资源
GC 停顿	增加 Go 内存限制

13.4 写入失败

问题：写入返回错误

# 测试写入
curl -s -XPOST 'localhost:4001/db/execute' \
    -H 'Content-Type: application/json' \
    -d '[["INSERT INTO test VALUES (1, \"hello\")"]]' | python3 -m json.tool

HTTP 状态码	错误信息	原因	解决方案
400	`UNIQUE constraint failed`	唯一约束冲突	检查数据或使用 INSERT OR IGNORE
400	`no such table`	表不存在	先创建表
401	`unauthorized`	认证失败	检查用户名/密码
409	`not the leader`	节点非 Leader	请求 Leader 或使用重试
500	`store is closed`	存储引擎关闭	检查节点状态
503	`no leader`	集群无 Leader	等待选举完成

问题：写入超时

# 测试带超时的写入
curl -s --max-time 30 -XPOST 'localhost:4001/db/execute' \
    -H 'Content-Type: application/json' \
    -d '[["INSERT INTO test VALUES (1, \"hello\")"]]' \
    -w "\nHTTP Code: %{http_code}\nTime: %{time_total}s\n"

超时原因	排查方法	解决方案
网络延迟	`ping` 和 `traceroute`	优化网络拓扑
磁盘 I/O	`iostat -x 1`	使用更快的磁盘
大事务	检查语句数量	拆分批次
锁竞争	检查并发写入	减少并发或合并写入

13.5 数据一致性问题

问题：Follower 读到旧数据

# 在 Leader 上写入
curl -s -XPOST 'localhost:4001/db/execute' \
    -H 'Content-Type: application/json' \
    -d '[["INSERT INTO test VALUES (99, \"new data\")"]]'

# 立即从 Follower 用 none 级别读取
curl -s -G 'localhost:4011/db/query' \
    --data-urlencode 'q=SELECT * FROM test WHERE id=99' \
    --data-urlencode 'level=none'
# 可能读到空结果（Follower 尚未同步）

# 使用 strong 级别确保一致性
curl -s -G 'localhost:4001/db/query' \
    --data-urlencode 'q=SELECT * FROM test WHERE id=99' \
    --data-urlencode 'level=strong'

问题	原因	解决方案
读到过期数据	使用了 `none` 一致性级别	改用 `weak` 或 `strong`
写后读不一致	从 Follower 读取	写后读使用 `strong` 级别
数据丢失	少数派分区接受了写入	确保写入通过 Leader

13.6 性能问题

问题：写入速度慢

# 诊断脚本
echo "=== 写入性能诊断 ==="

# 1. 测试单条写入延迟
time curl -s -XPOST 'localhost:4001/db/execute' \
    -H 'Content-Type: application/json' \
    -d '[["INSERT INTO bench (data) VALUES (\"test\")"]]' > /dev/null

# 2. 测试批量写入延迟
stmts=$(python3 -c "
import json
stmts = []
for i in range(100):
    stmts.append([\"INSERT INTO bench (data) VALUES (?)\", f\"item-{i}\"])
print(json.dumps(stmts))
")

time curl -s -XPOST 'localhost:4001/db/execute' \
    -H 'Content-Type: application/json' \
    -d "$stmts" > /dev/null

# 3. 检查磁盘 I/O
iostat -x 1 3

# 4. 检查 Raft 日志大小
ls -lh /var/lib/rqlite/data/raft/logs.db

问题：查询速度慢

# 1. 查看执行计划
curl -s -G 'localhost:4001/db/query' \
    --data-urlencode 'q=EXPLAIN QUERY PLAN SELECT * FROM orders WHERE user_id = 1' \
    | python3 -m json.tool

# 2. 检查索引
curl -s -G 'localhost:4001/db/query' \
    --data-urlencode 'q=SELECT name, sql FROM sqlite_master WHERE type=\"index\"'

# 3. 检查数据量
curl -s -G 'localhost:4001/db/query' \
    --data-urlencode 'q=SELECT COUNT(*) FROM orders'

性能问题	诊断方法	优化方案
无索引查询	EXPLAIN QUERY PLAN	创建索引
SELECT *	检查查询语句	仅选择需要的列
大表全扫	检查 WHERE 条件	使用 LIMIT 和索引
锁等待	检查并发操作	减少并发写入

13.7 网络和连接问题

问题：节点间无法通信

# 检查节点间端口连通性
for node in node1 node2 node3; do
    echo "Testing $node Raft port..."
    nc -zv $node 4002
done

# 检查防火墙规则
sudo iptables -L -n | grep 400
sudo ufw status | grep 400

# 检查 DNS 解析
nslookup node1
nslookup node2

问题：客户端连接被拒绝

# 1. 检查 rqlite 是否在监听
ss -tlnp | grep 4001

# 2. 检查连接数
curl -s 'localhost:4001/status' | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f'Open connections: {d[\"store\"][\"num_open_connections\"]}')
"

# 3. 检查系统连接限制
ulimit -n
cat /proc/sys/net/core/somaxconn

13.8 磁盘空间问题

问题：磁盘空间不足

# 检查磁盘使用
df -h /var/lib/rqlite

# 检查数据目录大小
du -sh /var/lib/rqlite/data/*
du -sh /var/lib/rqlite/data/raft/*

# 检查 WAL 文件大小
ls -lh /var/lib/rqlite/data/db.sqlite*

文件	说明	清理方法
`db.sqlite`	主数据库文件	不可直接删除
`db.sqlite-wal`	WAL 文件	会被自动清理，或执行 checkpoint
`raft/logs.db`	Raft 日志	自动截断，或重建节点
`raft/snapshots/`	快照目录	旧快照自动清理

# 执行 VACUUM 回收空间（需在 Leader 上执行）
curl -s -XPOST 'localhost:4001/db/execute' \
    -H 'Content-Type: application/json' \
    -d '[["VACUUM"]]'

# 如果 Raft 日志过大，可以重建节点
# 1. 从集群移除节点
curl -XPOST 'localhost:4001/remove' -d '{"id": "node3"}'
# 2. 清除节点数据
rm -rf /var/lib/rqlite/data3/*
# 3. 重新加入
rqlited -node-id=node3 -join=http://localhost:4001 /var/lib/rqlite/data3

13.9 常见问题速查表

问题	症状	快速解决方案
无法启动	进程立即退出	检查日志、端口、权限
无 Leader	写入返回 503	检查多数节点是否在线
写入超时	请求长时间无响应	检查网络、磁盘、批量大小
数据不一致	Follower 读到旧数据	使用合适的 consistency level
节点无法加入	join 请求失败	清除数据目录重试
磁盘满	写入失败	VACUUM + 清理旧快照
内存占用高	RSS 持续增长	使用 `-on-disk` 模式
认证失败	HTTP 401	检查 auth.json 和认证参数
TLS 错误	握手失败	检查证书和 SAN 配置

13.10 诊断命令速查

# 节点状态
curl -s localhost:4001/status?pretty

# 集群节点列表
curl -s localhost:4001/nodes?pretty

# Leader 检查
curl -s -o /dev/null -w "%{http_code}" localhost:4001/status/leader

# 就绪检查
curl -s -o /dev/null -w "%{http_code}" localhost:4001/status/ready

# 进程状态
systemctl status rqlited

# 最近日志
journalctl -u rqlited --since "10 minutes ago" --no-pager

# 数据目录大小
du -sh /var/lib/rqlite/data/*

# 端口检查
ss -tlnp | grep 400

# 磁盘空间
df -h /var/lib/rqlite

# SQLite 完整性
sqlite3 /var/lib/rqlite/data/db.sqlite "PRAGMA integrity_check;"

13.11 本章小结

要点	内容
排查流程	节点状态 → 集群状态 → 写入/读取 → 日志
最常见问题	节点离线、无 Leader、端口不通
写入失败	检查认证、Leader 状态、错误信息
性能问题	索引、批量、一致性级别
数据一致性	选择合适的 consistency level
预防措施	监控、备份、定期健康检查

上一章：第 12 章：监控与可观测性下一章：第 14 章：最佳实践