← 返回首页
🔄

灾备恢复策略

📂 devops ⏱ 2 min 317 words

灾备恢复策略

灾备级别

级别 RTO RPO 说明
冷备 小时级 天级 手动恢复
温备 分钟级 小时级 半自动恢复
热备 秒级 分钟级 自动切换
双活 0 0 同时运行

RTO和RPO

数据库灾备

MySQL主从复制

-- 主节点配置
CHANGE MASTER TO
  MASTER_HOST='primary.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='secret',
  MASTER_AUTO_POSITION=1;

-- 启动复制
START SLAVE;

-- 检查状态
SHOW SLAVE STATUS\G

自动故障切换

# Orchestrator配置
{
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "secret",
  "RecoverMasterClusterFilters": ["*"],
  "RecoverIntermediateMasterClusterFilters": ["*"],
  "OnFailureDetectionProcess": "echo '{failureType}' >> /var/log/orchestrator/detection.log",
  "PreFailoverProcesses": [],
  "PostFailoverProcesses": [],
  "PostUnsuccessfulFailoverProcesses": []
}

Kubernetes灾备

etcd备份

# 备份etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 恢复etcd
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd.db \
  --data-dir=/var/lib/etcd-restore

Velero备份

# 安装Velero
velero install \
  --provider aws \
  --bucket my-k8s-backup \
  --secret-file ./credentials \
  --backup-location-config region=us-west-2

# 创建备份
velero backup create daily-backup --schedule="0 2 * * *"

# 恢复备份
velero restore create --from-backup daily-backup

实践:完整灾备方案

# 1. Velero定期备份
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    ttl: 720h
    includedNamespaces:
      - production
    storageLocation: default
    volumeSnapshotLocations:
      - default

---
# 2. 跨区域复制
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: secondary
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: my-k8s-backup-secondary
  config:
    region: us-east-1

---
# 3. 恢复策略
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: disaster-recovery
  namespace: velero
spec:
  backupName: daily-backup-20240101
  includedNamespaces:
    - production
  restorePVs: true

数据库备份策略

#!/bin/bash
# 备份脚本

BACKUP_DIR="/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
KEEP_DAYS=7

# 全量备份
mysqldump --all-databases --single-transaction \
  --routines --triggers --events \
  | gzip > "$BACKUP_DIR/full_$DATE.sql.gz"

# 上传到S3
aws s3 cp "$BACKUP_DIR/full_$DATE.sql.gz" \
  s3://my-backup/mysql/full_$DATE.sql.gz

# 清理本地旧备份
find $BACKUP_DIR -name "*.gz" -mtime +$KEEP_DAYS -delete

故障切换演练

#!/bin/bash
# 故障切换演练脚本

echo "=== 故障切换演练 ==="

# 1. 模拟主节点故障
echo "模拟主节点故障..."
kubectl delete pod mysql-primary-0

# 2. 等待自动切换
echo "等待故障切换..."
sleep 30

# 3. 验证新主节点
echo "验证新主节点..."
kubectl exec mysql-secondary-0 -- mysql -e "SELECT @@server_id"

# 4. 验证数据一致性
echo "验证数据一致性..."
kubectl exec mysql-secondary-0 -- mysql -e "SELECT COUNT(*) FROM users"

echo "=== 演练完成 ==="

监控和告警

groups:
  - name: disaster-recovery
    rules:
      - alert: DatabaseReplicationLag
        expr: mysql_slave_status_seconds_behind_master > 30
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "数据库复制延迟超过30秒"
      
      - alert: BackupFailed
        expr: velero_backup_last_status{phase="Failed"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Velero备份失败"

最佳实践

  1. 定期备份和测试恢复
  2. 多区域冗余
  3. 自动化故障切换
  4. 监控复制状态
  5. 文档化恢复流程

总结

灾备恢复是企业级运维的核心能力。通过多层备份、自动切换和定期演练,可以确保业务的连续性。