Dnsmasq 服务搭建完全教程 / 第 09 章:高可用与故障切换
第 09 章:高可用与故障切换
9.1 为什么需要高可用
DNS 和 DHCP 是网络基础设施的核心服务:
| 服务 | 故障影响 |
|---|
| DNS 故障 | 所有域名解析失败,互联网访问中断 |
| DHCP 故障 | 新设备无法获取 IP,租约到期设备断网 |
SLA 目标:
| 场景 | 可用性要求 | 允许停机时间/年 |
|---|
| 家庭网络 | 99% | 3.65 天 |
| 小型企业 | 99.9% | 8.76 小时 |
| 中型企业 | 99.99% | 52.56 分钟 |
9.2 高可用架构方案
方案一:主备 + Keepalived
┌─────────────┐ ┌─────────────┐
│ Dnsmasq 主 │←──心跳──→│ Dnsmasq 备 │
│ 192.168.1.1 │ │ 192.168.1.2 │
└──────┬──────┘ └──────┬──────┘
│ │
└───────────┬───────────┘
│
VIP: 192.168.1.254
│
┌──────┴──────┐
│ 客户端 │
└─────────────┘
方案二:双活 + 负载均衡
客户端 DNS 配置:
nameserver 192.168.1.1
nameserver 192.168.1.2
两个 Dnsmasq 实例同时工作,客户端自动故障切换
方案三:DHCP 分割作用域
主服务器:192.168.1.100 - 192.168.1.150(50%)
备服务器:192.168.1.151 - 192.168.1.200(50%)
9.3 Keepalived 部署
9.3.1 安装 Keepalived
# Debian/Ubuntu
sudo apt install keepalived
# CentOS/RHEL
sudo yum install keepalived
# 验证安装
keepalived --version
9.3.2 主节点 Keepalived 配置
# /etc/keepalived/keepalived.conf (主节点)
global_defs {
router_id DNS_MASTER
script_user root
enable_script_security
}
vrrp_script chk_dnsmasq {
script "/usr/bin/killall -0 dnsmasq"
interval 2 # 每 2 秒检查一次
weight -20 # 失败时降低优先级 20
fall 3 # 连续失败 3 次判定为故障
rise 2 # 连续成功 2 次判定为恢复
}
vrrp_instance VI_DNS {
state MASTER
interface eth1
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass dns_ha_secret
}
virtual_ipaddress {
192.168.1.254/24 dev eth1
}
track_script {
chk_dnsmasq
}
# 状态切换时执行的脚本
notify_master "/etc/keepalived/scripts/notify.sh MASTER"
notify_backup "/etc/keepalived/scripts/notify.sh BACKUP"
notify_fault "/etc/keepalived/scripts/notify.sh FAULT"
}
9.3.3 备节点 Keepalived 配置
# /etc/keepalived/keepalived.conf (备节点)
global_defs {
router_id DNS_BACKUP
script_user root
enable_script_security
}
vrrp_script chk_dnsmasq {
script "/usr/bin/killall -0 dnsmasq"
interval 2
weight -20
fall 3
rise 2
}
vrrp_instance VI_DNS {
state BACKUP
interface eth1
virtual_router_id 51
priority 90 # 低于主节点
advert_int 1
authentication {
auth_type PASS
auth_pass dns_ha_secret
}
virtual_ipaddress {
192.168.1.254/24 dev eth1
}
track_script {
chk_dnsmasq
}
notify_master "/etc/keepalived/scripts/notify.sh MASTER"
notify_backup "/etc/keepalived/scripts/notify.sh BACKUP"
notify_fault "/etc/keepalived/scripts/notify.sh FAULT"
}
9.3.4 状态切换通知脚本
sudo mkdir -p /etc/keepalived/scripts
sudo tee /etc/keepalived/scripts/notify.sh <<'SCRIPT'
#!/bin/bash
STATE=$1
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
LOGFILE="/var/log/keepalived-state.log"
echo "[$TIMESTAMP] State changed to: $STATE" >> "$LOGFILE"
case $STATE in
MASTER)
# 成为主节点:确保 Dnsmasq 正常运行
systemctl start dnsmasq
# 发送告警通知(邮件/钉钉/Slack)
# curl -X POST "https://hooks.slack.com/..." -d '{"text":"DNS Master activated"}'
;;
BACKUP)
# 成为备节点:保持 Dnsmasq 运行(备用状态)
systemctl start dnsmasq
;;
FAULT)
# 故障状态:记录日志
logger -t keepalived "DNS HA entered FAULT state"
;;
esac
SCRIPT
sudo chmod +x /etc/keepalived/scripts/notify.sh
9.3.5 双节点 Dnsmasq 配置同步
# 两台服务器的 Dnsmasq 配置必须完全一致
# 方法 1:手动同步(简单但不实时)
rsync -avz /etc/dnsmasq.d/ backup-server:/etc/dnsmasq.d/
# 方法 2:使用 inotifywait 自动同步
sudo apt install inotify-tools
#!/bin/bash
# /usr/local/bin/sync-dnsmasq.sh
REMOTE="backup-server"
inotifywait -m -r -e modify,create,delete /etc/dnsmasq.d/ |
while read path action file; do
rsync -avz /etc/dnsmasq.d/ $REMOTE:/etc/dnsmasq.d/
ssh $REMOTE "systemctl reload dnsmasq"
done
9.3.6 DHCP 租约同步
# DHCP 租约文件也需要同步,避免主备切换时地址冲突
# 方法 1:共享存储(NFS)
# 两台服务器挂载同一 NFS 目录存放租约文件
# leasefile=/nfs/shared/dnsmasq.leases
# 方法 2:使用固定地址范围 + 静态绑定
# 主服务器使用 192.168.1.100-150
# 备服务器使用 192.168.1.151-200
# 重叠区域为 0,不会冲突
# 方法 3:租约文件同步脚本
#!/bin/bash
# /usr/local/bin/sync-leases.sh
REMOTE="backup-server"
scp /var/lib/misc/dnsmasq.leases $REMOTE:/var/lib/misc/dnsmasq.leases
ssh $REMOTE "systemctl reload dnsmasq"
9.4 双活 DNS 配置
9.4.1 简单双活(客户端配置两个 DNS)
# 不需要 Keepalived,客户端直接配置两个 DNS
# 主服务器 192.168.1.1 配置
listen-address=192.168.1.1
bind-interfaces
# 备服务器 192.168.1.2 配置
listen-address=192.168.1.2
bind-interfaces
# DHCP 下发两个 DNS
dhcp-option=option:dns-server,192.168.1.1,192.168.1.2
9.4.2 双活 DNS 的数据一致性
# 两台服务器使用相同的:
# - /etc/dnsmasq.hosts(本地记录)
# - /etc/dnsmasq.conf + /etc/dnsmasq.d/(配置文件)
# - 上游 DNS 设置
# 同步方案
# 方案 A:Git 仓库管理配置
# /etc/dnsmasq.d/ 是 Git 仓库,两台服务器 pull 最新配置
cd /etc/dnsmasq.d && git pull && systemctl reload dnsmasq
# 方案 B:NFS 共享配置
# mount -t nfs config-server:/export/dnsmasq /etc/dnsmasq.d
# 方案 C:Ansible 批量管理
# ansible-playbook deploy-dnsmasq.yml
9.4.3 DHCP 分割作用域
# 主服务器 - 上半段地址池
# /etc/dnsmasq.d/dhcp-primary.conf
interface=eth1
dhcp-range=set:primary,192.168.1.100,192.168.1.149,255.255.255.0,24h
dhcp-option=tag:primary,option:router,192.168.1.1
dhcp-option=tag:primary,option:dns-server,192.168.1.1,192.168.1.2
dhcp-authoritative
# 备服务器 - 下半段地址池
# /etc/dnsmasq.d/dhcp-secondary.conf
interface=eth1
dhcp-range=set:secondary,192.168.1.150,192.168.1.199,255.255.255.0,24h
dhcp-option=tag:secondary,option:router,192.168.1.1
dhcp-option=tag:secondary,option:dns-server,192.168.1.1,192.168.1.2
dhcp-authoritative
9.5 健康检查脚本
9.5.1 DNS 健康检查
#!/bin/bash
# /usr/local/bin/check-dns-health.sh
DOMAIN="www.baidu.com"
DNS_SERVER="127.0.0.1"
MAX_RETRIES=3
TIMEOUT=5
for i in $(seq 1 $MAX_RETRIES); do
result=$(dig @$DNS_SERVER $DOMAIN +short +timeout=$TIMEOUT 2>/dev/null)
if [ -n "$result" ]; then
echo "DNS OK: $DOMAIN resolved to $result"
exit 0
fi
sleep 1
done
echo "DNS CRITICAL: Failed to resolve $DOMAIN after $MAX_RETRIES attempts"
exit 2
9.5.2 DHCP 健康检查
#!/bin/bash
# /usr/local/bin/check-dhcp-health.sh
# 检查 Dnsmasq 进程
if ! pidof dnsmasq > /dev/null; then
echo "CRITICAL: dnsmasq process not running"
exit 2
fi
# 检查端口监听
if ! ss -uln | grep -q ":67 "; then
echo "CRITICAL: DHCP port 67 not listening"
exit 2
fi
# 检查租约文件
LEASE_FILE="/var/lib/misc/dnsmasq.leases"
if [ ! -f "$LEASE_FILE" ]; then
echo "WARNING: Lease file missing"
exit 1
fi
# 检查最近租约活动
LEASE_COUNT=$(wc -l < "$LEASE_FILE")
echo "OK: dnsmasq running, $LEASE_COUNT active leases"
exit 0
9.5.3 综合监控脚本
#!/bin/bash
# /usr/local/bin/monitor-dnsmasq.sh
LOG="/var/log/dnsmasq-monitor.log"
ALERT_EMAIL="[email protected]"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"
send_alert() {
local message="$1"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] ALERT: $message" >> "$LOG"
# 邮件通知
echo "$message" | mail -s "Dnsmasq Alert" "$ALERT_EMAIL" 2>/dev/null
# Slack 通知
curl -s -X POST "$SLACK_WEBHOOK" \
-d "{\"text\":\"🚨 Dnsmasq Alert: $message\"}" \
-H 'Content-Type: application/json' > /dev/null 2>&1
}
# 检查进程
if ! pidof dnsmasq > /dev/null; then
send_alert "Dnsmasq process not running!"
# 自动重启
systemctl restart dnsmasq
if [ $? -eq 0 ]; then
send_alert "Dnsmasq restarted successfully"
else
send_alert "Dnsmasq restart FAILED"
fi
fi
# 检查 DNS 解析
result=$(dig @127.0.0.1 www.baidu.com +short +timeout=3 2>/dev/null)
if [ -z "$result" ]; then
send_alert "DNS resolution failed!"
fi
# 检查内存使用
RSS=$(ps -o rss= -p $(pidof dnsmasq) 2>/dev/null)
if [ -n "$RSS" ] && [ "$RSS" -gt 102400 ]; then # 超过 100MB
send_alert "Dnsmasq memory usage high: ${RSS}KB"
fi
# 检查缓存命中率
# 发送 SIGUSR1 输出统计
kill -USR1 $(pidof dnsmasq) 2>/dev/null
9.5.4 Systemd 自动重启
# /etc/systemd/system/dnsmasq.service.d/restart.conf
[Service]
Restart=always
RestartSec=5
StartLimitIntervalSec=300
StartLimitBurst=5
9.6 Keepalived 高级配置
9.6.1 非抢占模式
# 避免频繁切换(主节点恢复后不自动切回)
vrrp_instance VI_DNS {
state BACKUP # 两台都设为 BACKUP
nopreempt # 启用非抢占
priority 100 # 主节点优先级更高
...
}
9.6.2 多 VRRP 实例(DNS + DHCP 分离)
# DNS 主 → 服务器 A
# DHCP 主 → 服务器 B
# 两个服务互相备份
# 服务器 A 的 Keepalived 配置
vrrp_instance VI_DNS {
state MASTER
priority 100
virtual_ipaddress { 192.168.1.253/24 }
}
vrrp_instance VI_DHCP {
state BACKUP
priority 90
virtual_ipaddress { 192.168.1.252/24 }
}
# 服务器 B 的 Keepalived 配置(反向)
vrrp_instance VI_DNS {
state BACKUP
priority 90
virtual_ipaddress { 192.168.1.253/24 }
}
vrrp_instance VI_DHCP {
state MASTER
priority 100
virtual_ipaddress { 192.168.1.252/24 }
}
9.7 测试故障切换
9.7.1 手动故障模拟
# 方法 1:停止 Dnsmasq
sudo systemctl stop dnsmasq
# 方法 2:停止 Keepalived
sudo systemctl stop keepalived
# 方法 3:断开网络接口
sudo ip link set eth1 down
# 方法 4:降低优先级
# 编辑 keepalived.conf,降低 priority,reload
# 监控切换过程
sudo tcpdump -i eth1 -n vrrp
9.7.2 切换时间测试
# 从客户端持续测试 DNS,观察切换时间
#!/bin/bash
# test-failover.sh
while true; do
result=$(dig @192.168.1.254 www.baidu.com +short +timeout=1 2>/dev/null)
timestamp=$(date '+%H:%M:%S')
if [ -z "$result" ]; then
echo "[$timestamp] FAIL"
else
echo "[$timestamp] OK: $result"
fi
sleep 1
done
9.8 完整高可用配置示例
主节点完整配置清单
节点信息:
- 主机名:dns-master
- IP:192.168.1.1
- VIP:192.168.1.254
- 角色:MASTER
需要配置的文件:
1. /etc/dnsmasq.d/*.conf (Dnsmasq 配置)
2. /etc/keepalived/keepalived.conf (Keepalived 配置)
3. /etc/keepalived/scripts/notify.sh (通知脚本)
4. /usr/local/bin/monitor-dnsmasq.sh (监控脚本)
部署步骤
# 1. 安装软件
sudo apt install dnsmasq keepalived
# 2. 配置 Dnsmasq(两台服务器配置相同)
sudo cp -r /path/to/dnsmasq-config/* /etc/dnsmasq.d/
# 3. 配置 Keepalived(注意主备 priority 不同)
sudo cp /path/to/keepalived-master.conf /etc/keepalived/keepalived.conf
# 4. 启动服务
sudo systemctl enable --now dnsmasq
sudo systemctl enable --now keepalived
# 5. 验证 VIP
ip addr show eth1 | grep 192.168.1.254
# 6. 测试 DNS
dig @192.168.1.254 www.baidu.com
# 7. 测试故障切换
sudo systemctl stop dnsmasq
# 在另一台服务器上验证 VIP 漂移
9.9 小结
| 方案 | 优点 | 缺点 | 适用场景 |
|---|
| Keepalived 主备 | 自动切换,VIP 不变 | 需要额外软件 | 企业网络 |
| 双活 DNS | 简单,无需额外软件 | 需手动同步配置 | 小型网络 |
| DHCP 分割作用域 | 无单点故障 | 地址池减半 | 大型网络 |
9.10 扩展阅读