容量规划:资源预测与规划
容量规划框架
容量规划流程:
├── 1. 收集历史数据
├── 2. 分析使用模式
├── 3. 预测未来需求
├── 4. 制定扩展计划
└── 5. 验证和调整
数据收集
系统指标收集
#!/bin/bash
# collect-metrics.sh
METRICS_FILE="/data/metrics/$(date +%Y%m%d).csv"
echo "时间,CPU使用率,内存使用率,磁盘使用率,网络流量" > $METRICS_FILE
# 每5秒收集一次
for i in $(seq 1 17280); do
TIMESTAMP=$(date +%Y-%m-%d\ %H:%M:%S)
CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEMORY=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}')
DISK=$(df -h / | tail -1 | awk '{print $5}' | cut -d'%' -f1)
NETWORK=$(cat /proc/net/dev | grep eth0 | awk '{print $2}')
echo "$TIMESTAMP,$CPU,$MEMORY,$DISK,$NETWORK" >> $METRICS_FILE
sleep 5
done
Prometheus指标
# prometheus-config.yaml
scrape_configs:
- job_name: 'system-metrics'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 30s
- job_name: 'app-metrics'
static_configs:
- targets: ['app:8080']
metrics_path: '/metrics'
使用模式分析
Python分析脚本
#!/usr/bin/env python3
# analyze-usage.py
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
def analyze_usage_patterns(data_file):
"""分析使用模式"""
df = pd.read_csv(data_file, parse_dates=['时间'])
# 按小时聚合
hourly = df.set_index('时间').resample('H').mean()
# 识别高峰时段
peak_hours = hourly['CPU使用率'].nlargest(10)
print("高峰时段:")
print(peak_hours)
# 计算增长率
daily_avg = df.set_index('时间').resample('D').mean()
growth_rate = daily_avg.pct_change().mean()
print(f"\n日均增长率: {growth_rate:.2%}")
return hourly, growth_rate
def predict_capacity(hourly_data, growth_rate, days_ahead=90):
"""预测未来容量需求"""
last_value = hourly_data['CPU使用率'].iloc[-1]
predictions = []
for day in range(1, days_ahead + 1):
predicted = last_value * (1 + growth_rate) ** day
predictions.append({
'date': datetime.now() + timedelta(days=day),
'predicted_cpu': predicted
})
return pd.DataFrame(predictions)
if __name__ == "__main__":
hourly, growth = analyze_usage_patterns("/data/metrics/usage.csv")
predictions = predict_capacity(hourly, growth)
print("\n容量预测:")
print(predictions.tail(10))
容量模型
容量计算
#!/usr/bin/env python3
# capacity-model.py
class CapacityPlanner:
def __init__(self):
self.sla_target = 99.9
self.headroom_percent = 30
def calculate_required_capacity(self, current_usage, growth_rate, months):
"""计算所需容量"""
required = current_usage * (1 + growth_rate) ** months
# 添加缓冲
required_with_headroom = required * (1 + self.headroom_percent / 100)
return required_with_headroom
def estimate_instances(self, required_capacity, instance_capacity):
"""估算实例数量"""
import math
return math.ceil(required_capacity / instance_capacity)
def cost_estimate(self, instances, instance_cost):
"""成本估算"""
return instances * instance_cost
# 使用示例
planner = CapacityPlanner()
required = planner.calculate_required_capacity(
current_usage=1000, # 当前QPS
growth_rate=0.15, # 月增长率15%
months=6 # 预测6个月后
)
instances = planner.estimate_instances(required, 200) # 每实例200 QPS
cost = planner.cost_estimate(instances, 50) # 每实例$50/月
print(f"6个月后所需容量: {required:.0f} QPS")
print(f"所需实例数: {instances}")
print(f"预估月成本: ${cost}")
扩展策略
水平扩展
# horizontal-autoscaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
垂直扩展
# vertical-autoscaling.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: "250m"
memory: "256Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
数据库容量
连接池配置
# database-pool.yaml
database:
connection_pool:
min_connections: 5
max_connections: 100
idle_timeout: 30s
max_lifetime: 3600s
read_replicas:
enabled: true
count: 2
lag_threshold: 100 # 毫秒
存储扩展
#!/bin/bash
# expand-storage.sh
VOLUME_ID=$1
NEW_SIZE=$2
echo "扩展存储卷: $VOLUME_ID"
# AWS EBS扩展
aws ec2 modify-volume \
--volume-id $VOLUME_ID \
--size $NEW_SIZE
# 等待扩展完成
while [ $(aws ec2 describe-volume-modifications \
--volume-id $VOLUME_ID \
--query 'VolumeModifications[0].ModificationState' \
--output text) != "completed" ]; do
echo "等待扩展完成..."
sleep 10
done
echo "存储卷扩展完成"
监控和告警
容量告警
# prometheus-rules.yaml
groups:
- name: capacity-alerts
rules:
- alert: HighCPUUtilization
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 15m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "CPU使用率超过80%,当前值 {{ $value }}%"
- alert: HighMemoryUtilization
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 15m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "内存使用率超过85%,当前值 {{ $value }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
for: 15m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "磁盘剩余空间不足20%"
容量仪表盘
{
"title": "容量规划仪表盘",
"panels": [
{
"title": "CPU使用率趋势",
"type": "timeseries",
"targets": [{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}]
},
{
"title": "内存使用率",
"type": "gauge",
"targets": [{
"expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
}]
},
{
"title": "存储使用率",
"type": "timeseries",
"targets": [{
"expr": "(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100"
}]
}
]
}
最佳实践
- 持续监控: 持续收集和分析系统指标
- 定期审查: 每月审查容量使用情况
- 提前规划: 提前3-6个月进行容量规划
- 自动化扩展: 实现自动化的扩展机制
- 成本平衡: 在性能和成本之间找到平衡
- 压力测试: 定期进行压力测试验证容量