Kubernetes 监控与日志
Kubernetes 监控与日志
监控架构
Kubernetes 监控通常采用以下架构:
- Prometheus: 指标采集和存储
- Grafana: 可视化展示
- Alertmanager: 告警管理
- EFK/ELK: 日志收集和分析
部署 Prometheus + Grafana
使用 Helm 部署
# 添加仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 安装 kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace
# 查看部署状态
kubectl get pods -n monitoring
访问 Grafana
# 获取密码
kubectl get secret monitoring-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
# 端口转发
kubectl port-forward svc/monitoring-grafana -n monitoring 3000:80
# 访问 http://localhost:3000
Prometheus 配置
自定义 ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
labels:
release: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
创建 PrometheusRule 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
spec:
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} per second"
查询指标
# PromQL 示例
# CPU 使用率
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
# Pod 网络流量
rate(container_network_receive_bytes_total{namespace="default"}[5m])
日志收集
EFK 栈部署
# 部署 Elasticsearch
kubectl apply -f elasticsearch.yaml
# 部署 Fluentd
kubectl apply -f fluentd.yaml
# 部署 Kibana
kubectl apply -f kibana.yaml
Fluentd DaemonSet 配置
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:latest
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: containers
mountPath: /var/lib/docker/containers
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
常用监控命令
# 查看 Pod 资源使用
kubectl top pods
kubectl top nodes
# 查看 Prometheus 指标
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
# 查看 Grafana 仪表盘
kubectl port-forward svc/monitoring-grafana -n monitoring 3000:80
# 查看日志
kubectl logs -l app=my-app -n default --tail=100 -f
告警配置
Alertmanager 配置
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://webhook:5001/'
常用告警规则
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
- alert: HighCPU
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
- alert: HighMemory
expr: 100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) > 85
for: 5m
labels:
severity: warning
实践案例
监控 Spring Boot 应用
# 添加 Micrometer 依赖后配置
management:
endpoints:
web:
exposure:
include: health,info,prometheus
metrics:
export:
prometheus:
enabled: true
创建 Grafana 仪表盘
- 访问 Grafana
- 导入 Kubernetes 预置仪表盘(ID: 315, 641, 7249)
- 配置 Prometheus 数据源
常见问题
Prometheus 无法采集指标
# 检查 ServiceMonitor
kubectl get servicemonitor
kubectl describe servicemonitor my-app
# 检查 Endpoints
kubectl get endpoints -n monitoring
Grafana 无数据
# 检查数据源配置
kubectl get configmap -n monitoring
# 测试 Prometheus 连接
kubectl exec -it <grafana-pod> -n monitoring -- curl http://prometheus:9090/api/v1/query?query=up
最佳实践
- 使用 Helm 部署监控组件
- 配置合适的告警阈值
- 定期检查监控状态
- 保留足够的指标存储时间
- 使用标签组织监控目标
总结
完善的监控和日志系统是运维的基础。通过 Prometheus + Grafana 监控集群状态,EFK 栈收集日志,可以及时发现和解决问题,保障系统稳定运行。