最近更新时间:2026-06-22 15:20:50
Kubernetes 集群上部署监控系统,其核心组件包括 cAdvisor(容器资源采集)、Node Exporter(节点指标采集)、Prometheus(指标汇聚存储)和 Grafana(可视化展示)。通过 DaemonSet 确保每个节点上的监控组件运行,配置 Service 暴露端口,以及设置 RBAC 权限,最后通过测试 Deployment 验证监控效果。
各组件的作用及部署方式如下:
Kubernetes 集群版本需为 1.10 及以上。
需提前配置 kubectl 命令行工具。
集群需具备拉取外部镜像的能力,或提前准备好所需镜像。
部署操作均在 k8s01 节点(10.0.5.43/24)上执行。
本实践基于三节点集群,节点信息如下:
k8s01:10.0.5.43/24(部署操作节点)
k8s02:10.0.5.74/24
k8s03:10.0.5.29/24
kubectl create ns monitor验证:
kubectl get ns | grep monitor由于官方镜像位于谷歌镜像仓库,需要使用国内镜像源。本实践使用 cadvisordocker/cadvisor:v0.37.0。
docker pull cadvisordocker/cadvisor:v0.37.0注:配置文件较多,建议单独新建一个目录统一管理。
mkdir -p /opt/cadvisor_prome_gra
cd /opt/cadvisor_prome_gra部署 cAdvisor(DaemonSet)采集容器资源指标
部署 Node Exporter(DaemonSet + Service)采集节点指标
部署 Prometheus(ConfigMap + Deployment + Service)汇聚并存储指标
配置 RBAC 权限(ServiceAccount、ClusterRole、ClusterRoleBinding)
部署 kube-state-metrics 采集集群状态指标
部署 Grafana 对接 Prometheus 数据源并导入仪表板
创建测试 Deployment 验证监控效果
创建 DaemonSet 配置文件 case1-daemonset-deploy-cadvisor.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cadvisor
namespace: monitor
spec:
selector:
matchLabels:
app: cAdvisor
template:
metadata:
labels:
app: cAdvisor
spec:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
hostNetwork: true
restartPolicy: Always
containers:
- name: cadvisor
image: cadvisordocker/cadvisor:v0.37.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
volumeMounts:
- name: root
mountPath: /rootfs
- name: run
mountPath: /var/run
- name: sys
mountPath: /sys
- name: docker
mountPath: /var/lib/containerd
volumes:
- name: root
hostPath:
path: /
- name: run
hostPath:
path: /var/run
- name: sys
hostPath:
path: /sys
- name: docker
hostPath:
path: /var/lib/containerd应用配置并验证:
kubectl apply -f case1-daemonset-deploy-cadvisor.yaml
kubectl get pod -n monitor -o wide验证结果:因集群有 3 个节点,预期运行 3 个 Pod,状态均为 Running。
kubectl get pod -n monitor
# 预期输出
NAME READY STATUS RESTARTS AGE
cadvisor-79g2l 1/1 Running 0 15m
cadvisor-q2qdz 1/1 Running 0 15m
cadvisor-sdbww 1/1 Running 0 15m测试 cAdvisor 数据采集:通过浏览器访问 <节点 IP>:8080。
注:首次打开加载较慢,请耐心等待。
创建 DaemonSet 与 Service 配置文件 case2-daemonset-deploy-node-exporter.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitor
labels:
k8s-app: node-exporter
spec:
selector:
matchLabels:
k8s-app: node-exporter
template:
metadata:
labels:
k8s-app: node-exporter
spec:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
containers:
- image: prom/node-exporter:v1.3.1
imagePullPolicy: IfNotPresent
name: prometheus-node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
name: metrics
volumeMounts:
- mountPath: /host/proc
name: proc
- mountPath: /host/sys
name: sys
- mountPath: /host
name: rootfs
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
hostNetwork: true
hostPID: true
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: "true"
labels:
k8s-app: node-exporter
name: node-exporter
namespace: monitor
spec:
type: NodePort
ports:
- name: http
port: 9100
nodePort: 30000
protocol: TCP
selector:
k8s-app: node-exporter应用配置并验证数据采集。
kubectl apply -f case2-daemonset-deploy-node-exporter.yaml
kubectl get pod -n monitor通过浏览器访问 <节点 IP>:9100,点击 Metrics 查看指标数据。
创建 ConfigMap case3-1-prometheus-cfg.yaml,配置采集任务:
kind: ConfigMap
apiVersion: v1
metadata:
labels:
app: prometheus
name: prometheus-config
namespace: monitor
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 1m
scrape_configs:
- job_name: 'kubernetes-node'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kubernetes-node-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-apiserver'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_service_namekubectl apply -f case3-1-prometheus-cfg.yaml准备数据存储目录(在 k8s01 节点):
mkdir -p /data/prometheusdata
chmod 777 /data/prometheusdata/
chown 65534.65534 /data/prometheusdata/ -R创建 Deployment case3-2-prometheus-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-server
namespace: monitor
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
component: server
template:
metadata:
labels:
app: prometheus
component: server
annotations:
prometheus.io/scrape: 'false'
spec:
nodeName: k8s01
serviceAccountName: monitor
containers:
- name: prometheus
image: prom/prometheus:v2.31.2
imagePullPolicy: IfNotPresent
command:
- prometheus
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention=720h
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: /etc/prometheus/prometheus.yml
name: prometheus-config
subPath: prometheus.yml
- mountPath: /prometheus/
name: prometheus-storage-volume
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
items:
- key: prometheus.yml
path: prometheus.yml
mode: 0644
- name: prometheus-storage-volume
hostPath:
path: /data/prometheusdata
type: Directory创建 ServiceAccount 和 ClusterRoleBinding,并应用 Deployment:
kubectl create serviceaccount monitor -n monitor
kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor --clusterrole=cluster-admin --serviceaccount=monitor:monitor
kubectl apply -f case3-2-prometheus-deployment.yaml创建 Service 配置文件 case3-3-prometheus-svc.yaml 并应用:
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitor
labels:
app: prometheus
spec:
type: NodePort
ports:
- port: 9090
targetPort: 9090
nodePort: 30090
protocol: TCP
selector:
app: prometheus
component: serverkubectl apply -f case3-3-prometheus-svc.yaml
kubectl get svc -n monitor创建 RBAC 配置文件 case4-prom-rbac.yaml,包含 ServiceAccount、Secret、ClusterRole 和 ClusterRoleBinding:
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitor
---
apiVersion: v1
kind: Secret
metadata:
name: monitor-token
namespace: monitor
annotations:
kubernetes.io/service-account.name: "prometheus"
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitorkubectl apply -f case4-prom-rbac.yaml创建组合配置文件 case5-kube-state-metrics-deploy.yaml,包含 Deployment、ServiceAccount、ClusterRole、ClusterRoleBinding 和 Service。
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: registry.cn-beijing.ksyuncs.com/zbl/kube-state-metrics:v2.6.0
ports:
- containerPort: 8080
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources: ["daemonsets", "deployments", "replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
name: kube-state-metrics
namespace: kube-system
labels:
app: kube-state-metrics
spec:
type: NodePort
ports:
- name: kube-state-metrics
port: 8080
targetPort: 8080
nodePort: 31666
protocol: TCP
selector:
app: kube-state-metricskubectl apply -f case5-kube-state-metrics-deploy.yaml验证部署:
kubectl get sa -n kube-system | grep kube-state-metrics
kubectl get clusterrole -n kube-system | grep kube-state-metrics
kubectl get clusterrolebinding -n kube-system | grep kube-state-metrics部署 Grafana,创建 grafana-enterprise.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana-enterprise
namespace: monitor
spec:
replicas: 1
selector:
matchLabels:
app: grafana-enterprise
template:
metadata:
labels:
app: grafana-enterprise
spec:
containers:
- image: grafana/grafana
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
runAsUser: 0
name: grafana
ports:
- containerPort: 3000
protocol: TCP
volumeMounts:
- mountPath: "/var/lib/grafana"
name: data
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 500m
memory: 2500Mi
volumes:
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitor
spec:
type: NodePort
ports:
- port: 80
targetPort: 3000
nodePort: 31000
selector:
app: grafana-enterprisekubectl apply -f grafana-enterprise.yaml访问 Grafana:通过 http://<节点 IP>:31000 登录,默认用户名 admin,密码 admin123。
添加 Prometheus 数据源:进入 Configuration > Data Sources > Add data source,选择 Prometheus,命名为 prometheus,填写 Prometheus Service 地址。
注:URL 可使用集群内部 IP 或外部 EIP,Prometheus Service 端口为
30090。
导入监控仪表板:
点击 + > Import。
输入仪表板 ID:Node Exporter Full(11074)、Kubernetes Cluster(6417)、Kubernetes Pods(6336)。
为每个仪表板选择上一步创建的 Prometheus 数据源。
点击 Import 完成导入。
创建测试 Deployment nginx01.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx01
spec:
replicas: 2
selector:
matchLabels:
app: nginx01
template:
metadata:
labels:
app: nginx01
spec:
containers:
- name: nginx
image: nginx:1.7.9kubectl apply -f nginx01.yaml验证 Deployment 状态:
kubectl get deployments.apps
# 预期输出
NAME READY UP-TO-DATE AVAILABLE AGE
nginx01 2/2 2 2 55s在 Grafana 仪表板中查看 nginx01 的容器资源监控数据,确认数据正常展示。
cAdvisor Pod 无法启动:查看 Pod 日志 kubectl logs -n monitor -l app=cadvisor。常见原因为内核版本不兼容,可尝试降级镜像版本。
Prometheus Target 状态为 DOWN:依次检查 Prometheus 配置 kubectl describe configmap -n monitor prometheus-config、Service 端点 kubectl get endpoints -n monitor,并确认 Pod 标签与 Service 选择器匹配。
Grafana 仪表板无数据:
确认数据源 URL 配置正确(建议使用 http://prometheus.monitor.svc.cluster.local:9090)。
等待约 5 分钟后刷新页面,确保 Prometheus 已采集到数据。
将时间范围调整为 Last 5 minutes。
检查 Dashboard 变量(instance/node)是否与实际环境匹配。
镜像拉取失败:国内环境可替换为以下镜像源:
组件 | 替代镜像 |
cAdvisor |
|
Node Exporter |
|
Prometheus |
|
Grafana |
|
纯净模式
