强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

AgensGraph 完全指南 / 第 14 章:故障排查与常见问题

第 14 章:故障排查与常见问题

14.1 故障排查方法论

14.1.1 排查五步法

故障排查流程:

  1. 收集信息 → 错误消息、日志、监控指标
       │
  2. 定位问题 → 确定故障范围和影响
       │
  3. 分析原因 → 根因分析(Root Cause Analysis)
       │
  4. 实施修复 → 应用解决方案
       │
  5. 验证恢复 → 确认问题已解决
       │
  6. 总结复盘 → 记录经验教训

14.1.2 常见问题分类

分类典型问题紧急程度
连接问题连接被拒绝、连接超时🔴 高
语法错误Cypher/SQL 语法错误🟡 中
性能问题慢查询、内存溢出🟠 高
数据问题数据丢失、不一致🔴 高
存储问题磁盘满、WAL 堆积🔴 高
复制问题主从延迟、复制断开🟠 高
锁问题死锁、锁等待超时🟠 高

14.2 连接问题

14.2.1 连接被拒绝

错误消息:
  psql: error: connection to server at "localhost" (127.0.0.1), port 5432 failed:
  Connection refused
  Is the server running on that host and accepting TCP/IP connections?

排查步骤

# 1. 检查服务是否运行
ps aux | grep agens
# 或
pg_isready -h localhost -p 5432

# 2. 检查端口监听
netstat -tlnp | grep 5432
# 或
ss -tlnp | grep 5432

# 3. 检查日志
tail -50 $PGDATA/log/agens-*.log

# 4. 检查配置(listen_addresses)
grep listen_addresses $PGDATA/agens.conf

常见原因与解决方案

原因解决方案
服务未启动ag_ctl -D $PGDATA start
端口配置错误检查 agens.conf 中的 port
listen_addresses 未配置设置为 '*' 或具体 IP
防火墙阻拦开放 5432 端口
Docker 端口未映射检查 -p 5432:5432

14.2.2 认证失败

错误消息:
  FATAL: password authentication failed for user "agens"
  FATAL: no pg_hba.conf entry for host "192.168.1.100", user "agens", database "agens"

排查与解决

# 查看 pg_hba.conf 配置
cat $PGDATA/pg_hba.conf

# 确保允许来自客户端 IP 的连接
# 格式: TYPE DATABASE USER ADDRESS METHOD
# host  all      all  0.0.0.0/0  md5

# 重新加载配置
ag_ctl -D $PGDATA reload

14.2.3 连接数超限

错误消息:
  FATAL: sorry, too many clients already
-- 查看当前连接数
SELECT count(*) FROM pg_stat_activity;

-- 查看最大连接数
SHOW max_connections;

-- 查看各状态的连接数
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

-- 终止空闲连接
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
  AND query_start < now() - interval '10 minutes';

-- 或终止所有非自身连接
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
  AND datname = current_database();

14.3 Cypher 语法错误

14.3.1 常见语法错误

错误 1: graph_path 未设置
  ERROR: graph_path is not set
  解决: SET graph_path = your_graph_name;

错误 2: 标签不存在
  ERROR: label "Personn" does not exist
  (注意拼写错误: Personn → Person)

错误 3: 未绑定变量
  ERROR: variable 'p' is not defined
  解决: 确保 MATCH 中定义了变量

错误 4: 语法解析失败
  ERROR: syntax error at or near "RETRUN"
  (注意拼写错误: RETRUN → RETURN)

14.3.2 MERGE 陷阱

-- 陷阱:MERGE 创建了意外的重复数据
-- 原因:MERGE 检查整个模式是否完全匹配

-- ❌ 错误用法(可能创建重复的 Alice)
MATCH (a:Person {name: 'Alice'})
MERGE (a)-[:KNOWS]->(b:Person {name: 'Bob'})
-- 如果 Alice 有多条记录,每条都会创建一个到 Bob 的关系

-- ✅ 正确用法
MERGE (a:Person {name: 'Alice'})
MERGE (b:Person {name: 'Bob'})
MERGE (a)-[:KNOWS]->(b)
-- 每个 MERGE 独立检查,确保不重复

14.4 性能问题

14.4.1 慢查询诊断

-- 查找当前运行的慢查询
SELECT
    pid,
    now() - query_start AS duration,
    state,
    wait_event_type,
    wait_event,
    left(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state = 'active'
  AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;
-- 分析慢查询的执行计划
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
MATCH (p:Person)-[:KNOWS*1..3]->(fof:Person)
WHERE p.name = 'Alice'
RETURN DISTINCT fof.name;

14.4.2 常见性能问题清单

症状可能原因解决方案
全表扫描 (Seq Scan)缺少索引创建合适的索引
排序溢出磁盘work_mem 不足增大 work_mem
缓冲区命中率低shared_buffers 不足增大缓冲区
CPU 使用率高查询效率低优化查询 / 加索引
I/O 等待高数据不在缓存增加内存 / SSD
锁等待并发冲突优化事务粒度
OOM Killer内存超限降低连接数或 work_mem

14.4.3 内存问题排查

# 查看系统内存使用
free -h

# 查看 AgensGraph 进程内存
ps aux | grep agens | sort -k6 -rn

# 查看操作系统日志中的 OOM 事件
dmesg | grep -i oom
journalctl -k | grep -i oom
-- 查看共享内存使用
SELECT
    setting AS shared_buffers,
    pg_size_pretty(setting::bigint * 8192) AS shared_buffers_size
FROM pg_settings
WHERE name = 'shared_buffers';

-- 查看临时文件使用
SELECT
    datname,
    temp_files,
    pg_size_pretty(temp_bytes) AS temp_bytes
FROM pg_stat_database
WHERE datname = current_database();

14.5 存储问题

14.5.1 磁盘空间不足

# 检查磁盘使用率
df -h

# 检查数据目录大小
du -sh $PGDATA/
du -sh $PGDATA/base/
du -sh $PGDATA/pg_wal/
-- 查看数据库大小
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- 查看表大小
SELECT
    schemaname,
    relname,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_relation_size(relid)) AS table_size,
    pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;

-- 查看 WAL 大小
SELECT
    count(*) AS wal_files,
    pg_size_pretty(sum(size)) AS total_wal_size
FROM pg_ls_waldir();

14.5.2 清理磁盘空间

-- 手动 VACUUM(清理死元组)
VACUUM (VERBOSE, ANALYZE) person;

-- VACUUM FULL(回收空间,需要排他锁)
VACUMM FULL person;

-- 清理无用的 WAL 文件
-- PostgreSQL 自动管理,但检查点配置影响 WAL 积累
CHECKPOINT;

-- 删除不需要的表
DROP TABLE IF EXISTS old_data;

14.5.3 WAL 堆积问题

-- 检查复制槽(可能导致 WAL 堆积)
SELECT
    slot_name,
    slot_type,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

-- 删除不活跃的复制槽
-- SELECT pg_drop_replication_slot('inactive_slot_name');

-- 检查长事务(可能阻止 WAL 清理)
SELECT
    pid,
    usename,
    now() - xact_start AS xact_duration,
    now() - query_start AS query_duration,
    state,
    left(query, 80) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start;

14.6 锁与并发问题

14.6.1 锁等待诊断

-- 查看锁等待关系
SELECT
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocked.query AS blocked_query,
    blocking.query AS blocking_query,
    now() - blocked.query_start AS blocked_duration
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
JOIN pg_locks kl ON kl.locktype = bl.locktype
    AND kl.database IS NOT DISTINCT FROM bl.database
    AND kl.relation IS NOT DISTINCT FROM bl.relation
    AND kl.page IS NOT DISTINCT FROM bl.page
    AND kl.tuple IS NOT DISTINCT FROM bl.tuple
    AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid
    AND kl.pid != bl.pid
    AND kl.granted
JOIN pg_stat_activity blocking ON blocking.pid = kl.pid;

14.6.2 死锁排查

-- 查看死锁相关配置
SHOW deadlock_timeout;   -- 默认 1 秒
SHOW log_lock_waits;     -- 是否记录锁等待

-- 开启锁等待日志
ALTER SYSTEM SET log_lock_waits = on;
ALTER SYSTEM SET deadlock_timeout = '1s';
SELECT pg_reload_conf();

-- 之后在日志中搜索死锁信息
-- grep "deadlock detected" $PGDATA/log/*.log

14.6.3 长事务处理

-- 查找长事务
SELECT
    pid,
    usename,
    application_name,
    now() - xact_start AS xact_duration,
    state,
    left(query, 100) AS current_query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
  AND state != 'idle'
ORDER BY xact_start
LIMIT 10;

-- 终止超时事务(谨慎操作!)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE xact_start < now() - interval '30 minutes'
  AND pid != pg_backend_pid();

14.7 数据一致性问题

14.7.1 数据损坏检测

-- 使用 amcheck 扩展检查索引一致性
CREATE EXTENSION IF NOT EXISTS amcheck;

-- 检查 B-Tree 索引
SELECT bt_index_check(c.oid)
FROM pg_class c
JOIN pg_index i ON i.indexrelid = c.oid
WHERE c.relkind = 'i'
  AND c.relname LIKE '%person%';

-- 检查页面校验和
-- 需要在初始化时开启 data_checksums
SHOW data_checksums;

14.7.2 恢复到某个时间点

# PITR (Point-in-Time Recovery)
# 1. 停止服务
ag_ctl -D $PGDATA stop

# 2. 恢复基础备份
rm -rf $PGDATA/*
cp -r /backup/base/* $PGDATA/

# 3. 配置恢复
cat > $PGDATA/postgresql.auto.conf <<EOF
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2024-01-15 10:30:00'
recovery_target_action = 'promote'
EOF

# 4. 创建恢复信号文件
touch $PGDATA/recovery.signal

# 5. 启动恢复
ag_ctl -D $PGDATA start

14.8 日志分析

14.8.1 日志配置

# agens.conf 日志配置
logging_collector = on
log_directory = 'log'
log_filename = 'agens-%Y-%m-%d.log'
log_rotation_age = 1d
log_rotation_size = 100MB

# 日志详细程度
log_min_messages = warning        # 最低记录级别
log_min_error_statement = error   # 记录错误语句
log_statement = 'ddl'             # 记录 DDL 语句
log_min_duration_statement = 1000 # 记录超过 1 秒的语句

# 日志格式
log_line_prefix = '%t [%p-%l] %q%u@%d '
log_lock_waits = on
log_temp_files = 0

14.8.2 常用日志分析命令

# 查找错误信息
grep -i "error\|fatal\|panic" $PGDATA/log/agens-$(date +%Y-%m-%d).log

# 查找慢查询
grep "duration:" $PGDATA/log/agens-*.log | sort -t: -k2 -rn | head -20

# 查找死锁
grep "deadlock" $PGDATA/log/agens-*.log

# 统计错误频率
grep -c "ERROR" $PGDATA/log/agens-*.log

# 查找特定查询
grep "MATCH.*Person" $PGDATA/log/agens-*.log

14.8.3 结构化日志分析

# 使用 pgBadger 分析日志
# 安装
sudo apt-get install pgbadger

# 生成分析报告
pgbadger $PGDATA/log/agens-*.log -o report.html

# 按时间段分析
pgbadger -b "2024-01-15 00:00:00" -e "2024-01-15 23:59:59" \
  $PGDATA/log/agens-*.log -o daily_report.html

14.9 复制问题

14.9.1 复制延迟诊断

-- 主库:查看复制状态
SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
    replay_lag
FROM pg_stat_replication;

-- 备库:查看恢复状态
SELECT
    pg_is_in_recovery() AS is_replica,
    pg_last_wal_receive_lsn() AS last_receive,
    pg_last_wal_replay_lsn() AS last_replay,
    pg_last_xact_replay_timestamp() AS last_replay_time;

14.9.2 复制断开恢复

# 备库重新同步(重建备库)
# 1. 停止备库
ag_ctl -D $PGDATA stop

# 2. 使用 pg_basebackup 重新同步
rm -rf $PGDATA/*
pg_basebackup -h primary_host -U replicator -D $PGDATA -Fp -Xs -P

# 3. 配置 standby.signal
touch $PGDATA/standby.signal

# 4. 启动备库
ag_ctl -D $PGDATA start

14.10 诊断脚本

14.10.1 健康检查脚本

#!/bin/bash
# health_check.sh - AgensGraph 健康检查

echo "=== AgensGraph Health Check ==="
echo "Timestamp: $(date)"
echo ""

# 1. 服务状态
echo "--- Service Status ---"
if pg_isready -h localhost -p 5432 > /dev/null 2>&1; then
    echo "✅ Service: Running"
else
    echo "❌ Service: Down"
    exit 1
fi

# 2. 连接状态
echo ""
echo "--- Connections ---"
psql -U agens -t -c "
SELECT
    'Total: ' || count(*),
    'Active: ' || count(*) FILTER (WHERE state = 'active'),
    'Idle: ' || count(*) FILTER (WHERE state = 'idle')
FROM pg_stat_activity;
"

# 3. 数据库大小
echo ""
echo "--- Database Size ---"
psql -U agens -t -c "
SELECT pg_size_pretty(pg_database_size(current_database()));
"

# 4. 缓冲区命中率
echo ""
echo "--- Buffer Hit Ratio ---"
psql -U agens -t -c "
SELECT round(100.0 * sum(blks_hit) / (sum(blks_hit) + sum(blks_read)), 2) || '%'
FROM pg_stat_database WHERE datname = current_database();
"

# 5. 长事务
echo ""
echo "--- Long Transactions ---"
psql -U agens -t -c "
SELECT count(*) || ' transactions running > 5 min'
FROM pg_stat_activity
WHERE xact_start < now() - interval '5 minutes'
  AND state = 'active';
"

# 6. 锁等待
echo ""
echo "--- Lock Waits ---"
psql -U agens -t -c "
SELECT count(*) || ' sessions waiting for locks'
FROM pg_locks WHERE NOT granted;
"

# 7. 磁盘使用
echo ""
echo "--- Disk Usage ---"
df -h / | tail -1 | awk '{print "Usage: " $5 " of " $2}'

echo ""
echo "=== Health Check Complete ==="

14.11 本章小结

问题类型关键诊断命令预防措施
连接问题pg_isready, pg_stat_activity正确配置 pg_hba.conf
语法错误仔细阅读错误信息使用 IDE 语法检查
性能问题EXPLAIN ANALYZE, pg_stat_statements索引 + 参数调优
存储问题df, pg_database_size监控磁盘使用
锁问题pg_locks, pg_stat_activity缩短事务、固定加锁顺序
复制问题pg_stat_replication监控复制延迟

14.12 练习

  1. 编写一个完整的健康检查脚本并设置定时运行。
  2. 模拟一个锁等待场景,使用诊断查询定位阻塞源。
  3. 分析慢查询日志,找出 Top 5 最耗时的查询并优化。
  4. 编写一个自动清理空闲连接的脚本。

14.13 扩展阅读