灾难恢复:业务连续性保障
灾难恢复:业务连续性保障
什么是灾难恢复
灾难恢复(Disaster Recovery,DR)是指在发生灾难性事件后,恢复IT系统和数据的策略和流程。目标是最小化业务中断时间,确保数据完整性。
关键指标
灾难恢复关键指标:
├── RPO (Recovery Point Objective): 恢复点目标
│ └── 最多可以丢失多长时间的数据
├── RTO (Recovery Time Objective): 恢复时间目标
│ └── 系统需要多长时间恢复
└── MTTR (Mean Time To Recovery): 平均恢复时间
└── 实际恢复所需时间
灾难恢复策略
策略对比
recovery_strategies:
cold_site:
description: "冷备站点"
rto: "24-48小时"
rpo: "24小时"
cost: "低"
use_case: "非关键系统"
warm_site:
description: "温备站点"
rto: "4-24小时"
rpo: "1-4小时"
cost: "中"
use_case: "重要系统"
hot_site:
description: "热备站点"
rto: "15分钟-1小时"
rpo: "15分钟"
cost: "高"
use_case: "关键业务"
active_active:
description: "双活"
rto: "分钟级"
rpo: "接近0"
cost: "很高"
use_case: "核心业务"
数据备份策略
备份配置
# backup-strategy.yaml
backup:
database:
full_backup:
schedule: "0 2 * * *" # 每天凌晨2点
retention: "30d"
storage: "s3://backups/db/full/"
incremental_backup:
schedule: "0 */6 * * *" # 每6小时
retention: "7d"
storage: "s3://backups/db/incremental/"
wal_archive:
enabled: true
retention: "7d"
compression: true
application:
config_backup:
schedule: "0 0 * * *" # 每天
retention: "90d"
paths:
- "/etc/app/"
- "/opt/app/config/"
log_backup:
schedule: "0 1 * * *" # 每天凌晨1点
retention: "30d"
compression: true
infrastructure:
terraform_state:
enabled: true
storage: "s3://terraform-state/"
versioning: true
kubernetes_configs:
schedule: "0 * * * *" # 每小时
paths:
- "/etc/kubernetes/"
- "~/.kube/"
自动化备份脚本
#!/bin/bash
# backup.sh - 自动化备份脚本
BACKUP_DIR="/backup/$(date +%Y%m%d)"
S3_BUCKET="s3://my-backups"
RETENTION_DAYS=30
# 创建备份目录
mkdir -p $BACKUP_DIR
# 数据库备份
echo "执行数据库备份..."
pg_dump -U postgres -h localhost mydb | gzip > $BACKUP_DIR/db_full_$(date +%H%M%S).sql.gz
# 上传到S3
echo "上传到S3..."
aws s3 sync $BACKUP_DIR $S3_BUCKET/$(date +%Y%m%d)/
# 清理旧备份
echo "清理旧备份..."
find /backup -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;
# 通知
echo "备份完成: $BACKUP_DIR"
多区域部署
AWS多区域架构
# multi-region.tf
# 主区域
provider "aws" {
alias = "us_east"
region = "us-east-1"
}
# 灾备区域
provider "aws" {
alias = "us_west"
region = "us-west-2"
}
# 主区域资源
module "primary" {
source = "./modules/app"
providers = {
aws = aws.us_east
}
environment = "production"
is_primary = true
}
# 灾备区域资源
module "secondary" {
source = "./modules/app"
providers = {
aws = aws.us_west
}
environment = "production"
is_primary = false
# 从主区域同步数据
db_replication_source = module.primary.db_endpoint
}
# Route 53故障转移
resource "aws_route53_record" "app" {
zone_id = "Z1234567890"
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = module.primary.alb_dns
zone_id = module.primary.alb_zone_id
}
}
resource "aws_route53_record" "app_secondary" {
zone_id = "Z1234567890"
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = module.secondary.alb_dns
zone_id = module.secondary.alb_zone_id
}
}
Kubernetes灾难恢复
etcd备份
#!/bin/bash
# etcd-backup.sh
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 验证备份
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-table
# 上传到S3
aws s3 cp /backup/etcd-*.db s3://k8s-backups/etcd/
Kubernetes资源备份
#!/bin/bash
# k8s-backup.sh
BACKUP_DIR="/backup/k8s/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# 备份所有资源
for resource in deployments services configmaps secrets namespaces; do
kubectl get $resource --all-namespaces -o yaml > $BACKUP_DIR/$resource.yaml
done
# 备份PV信息
kubectl get pv -o yaml > $BACKUP_DIR/persistent-volumes.yaml
# 上传
aws s3 sync $BACKUP_DIR s3://k8s-backups/resources/
恢复流程
数据库恢复
#!/bin/bash
# database-restore.sh
BACKUP_FILE=$1
RESTORE_POINT=$2
# 1. 停止应用
echo "停止应用服务..."
systemctl stop app
# 2. 恢复数据库
echo "恢复数据库..."
gunzip -c $BACKUP_FILE | psql -U postgres -d mydb
# 3. 恢复到指定时间点(如果有WAL归档)
if [ -n "$RESTORE_POINT" ]; then
echo "恢复到时间点: $RESTORE_POINT"
# 配置recovery.conf或postgresql.conf
fi
# 4. 验证数据
echo "验证数据完整性..."
psql -U postgres -d mydb -c "SELECT COUNT(*) FROM users;"
# 5. 启动应用
echo "启动应用服务..."
systemctl start app
完整环境恢复
#!/bin/bash
# full-restore.sh
echo "开始完整环境恢复..."
# 1. 恢复基础设施
cd terraform/
terraform apply -auto-approve
# 2. 恢复Kubernetes集群
# 初始化新集群
kubeadm init
# 恢复etcd
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd/latest.db
# 3. 恢复应用
kubectl apply -f /backup/k8s/resources/
# 4. 恢复数据
./database-restore.sh /backup/db/latest.sql.gz
# 5. 验证
echo "验证系统状态..."
kubectl get pods --all-namespaces
curl -s http://app/health
故障转移演练
演练计划
# dr-drill.yaml
drill:
name: "季度灾难恢复演练"
schedule: "quarterly"
scope:
- "数据库故障转移"
- "应用服务故障转移"
- "网络故障模拟"
steps:
- name: "模拟主区域故障"
action: "关闭主区域服务"
duration: "30分钟"
- name: "验证灾备区域"
action: "检查灾备区域服务状态"
criteria:
- "服务可用"
- "数据一致性"
- "性能正常"
- name: "执行故障转移"
action: "切换DNS到灾备区域"
- name: "验证业务连续性"
action: "执行业务验证测试"
- name: "恢复主区域"
action: "重新启动主区域服务"
- name: "数据同步"
action: "同步数据到主区域"
演练执行脚本
#!/bin/bash
# dr-drill.sh
echo "开始灾难恢复演练..."
# 1. 通知
curl -X POST https://hooks.slack.com/services/xxx \
-d '{"text": "开始灾难恢复演练,预计持续1小时"}'
# 2. 模拟故障
echo "模拟主区域故障..."
kubectl scale deployment app --replicas=0 -n production
# 3. 故障转移
echo "执行故障转移..."
# 切换DNS
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch file://failover-changeset.json
# 4. 验证
echo "验证灾备区域..."
for i in {1..10}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://dr-app.example.com/health)
if [ "$STATUS" = "200" ]; then
echo "灾备区域验证通过"
break
fi
sleep 10
done
# 5. 恢复
echo "恢复主区域..."
kubectl scale deployment app --replicas=3 -n production
# 6. 数据同步
echo "同步数据..."
# 执行数据同步脚本
# 7. 通知完成
curl -X POST https://hooks.slack.com/services/xxx \
-d '{"text": "灾难恢复演练完成"}'
最佳实践
- 定期测试: 至少每年进行一次完整的灾难恢复演练
- 文档化: 保持灾难恢复流程文档更新
- 自动化: 尽可能自动化恢复流程
- 监控: 监控备份状态和数据一致性
- 多区域: 关键系统部署在多个地理区域
- 成本优化: 根据RPO/RTO要求选择合适的策略