灾备恢复策略
灾备恢复策略
灾备级别
| 级别 | RTO | RPO | 说明 |
|---|---|---|---|
| 冷备 | 小时级 | 天级 | 手动恢复 |
| 温备 | 分钟级 | 小时级 | 半自动恢复 |
| 热备 | 秒级 | 分钟级 | 自动切换 |
| 双活 | 0 | 0 | 同时运行 |
RTO和RPO
- RTO(恢复时间目标):系统恢复所需时间
- RPO(恢复点目标):可接受的数据丢失时间
数据库灾备
MySQL主从复制
-- 主节点配置
CHANGE MASTER TO
MASTER_HOST='primary.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='secret',
MASTER_AUTO_POSITION=1;
-- 启动复制
START SLAVE;
-- 检查状态
SHOW SLAVE STATUS\G
自动故障切换
# Orchestrator配置
{
"MySQLTopologyUser": "orchestrator",
"MySQLTopologyPassword": "secret",
"RecoverMasterClusterFilters": ["*"],
"RecoverIntermediateMasterClusterFilters": ["*"],
"OnFailureDetectionProcess": "echo '{failureType}' >> /var/log/orchestrator/detection.log",
"PreFailoverProcesses": [],
"PostFailoverProcesses": [],
"PostUnsuccessfulFailoverProcesses": []
}
Kubernetes灾备
etcd备份
# 备份etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 恢复etcd
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd.db \
--data-dir=/var/lib/etcd-restore
Velero备份
# 安装Velero
velero install \
--provider aws \
--bucket my-k8s-backup \
--secret-file ./credentials \
--backup-location-config region=us-west-2
# 创建备份
velero backup create daily-backup --schedule="0 2 * * *"
# 恢复备份
velero restore create --from-backup daily-backup
实践:完整灾备方案
# 1. Velero定期备份
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *"
template:
ttl: 720h
includedNamespaces:
- production
storageLocation: default
volumeSnapshotLocations:
- default
---
# 2. 跨区域复制
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: secondary
namespace: velero
spec:
provider: aws
objectStorage:
bucket: my-k8s-backup-secondary
config:
region: us-east-1
---
# 3. 恢复策略
apiVersion: velero.io/v1
kind: Restore
metadata:
name: disaster-recovery
namespace: velero
spec:
backupName: daily-backup-20240101
includedNamespaces:
- production
restorePVs: true
数据库备份策略
#!/bin/bash
# 备份脚本
BACKUP_DIR="/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
KEEP_DAYS=7
# 全量备份
mysqldump --all-databases --single-transaction \
--routines --triggers --events \
| gzip > "$BACKUP_DIR/full_$DATE.sql.gz"
# 上传到S3
aws s3 cp "$BACKUP_DIR/full_$DATE.sql.gz" \
s3://my-backup/mysql/full_$DATE.sql.gz
# 清理本地旧备份
find $BACKUP_DIR -name "*.gz" -mtime +$KEEP_DAYS -delete
故障切换演练
#!/bin/bash
# 故障切换演练脚本
echo "=== 故障切换演练 ==="
# 1. 模拟主节点故障
echo "模拟主节点故障..."
kubectl delete pod mysql-primary-0
# 2. 等待自动切换
echo "等待故障切换..."
sleep 30
# 3. 验证新主节点
echo "验证新主节点..."
kubectl exec mysql-secondary-0 -- mysql -e "SELECT @@server_id"
# 4. 验证数据一致性
echo "验证数据一致性..."
kubectl exec mysql-secondary-0 -- mysql -e "SELECT COUNT(*) FROM users"
echo "=== 演练完成 ==="
监控和告警
groups:
- name: disaster-recovery
rules:
- alert: DatabaseReplicationLag
expr: mysql_slave_status_seconds_behind_master > 30
for: 5m
labels:
severity: critical
annotations:
summary: "数据库复制延迟超过30秒"
- alert: BackupFailed
expr: velero_backup_last_status{phase="Failed"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Velero备份失败"
最佳实践
- 定期备份和测试恢复
- 多区域冗余
- 自动化故障切换
- 监控复制状态
- 文档化恢复流程
总结
灾备恢复是企业级运维的核心能力。通过多层备份、自动切换和定期演练,可以确保业务的连续性。