金山云-文档中心-K8S集群部署开源监控的实践教程

容器服务(KCE)

查看更多结果

未找到含当前关键字的文档标题

页面目录

全部展开全部收起

产品更新动态

未找到含该关键词的产品

文档中心

容器服务(KCE)

容器服务实践

监控

K8S集群部署开源监控的实践教程

最近更新时间：2026-06-22 15:20:50



背景信息

Kubernetes 集群上部署监控系统，其核心组件包括 cAdvisor（容器资源采集）、Node Exporter（节点指标采集）、Prometheus（指标汇聚存储）和 Grafana（可视化展示）。通过 DaemonSet 确保每个节点上的监控组件运行，配置 Service 暴露端口，以及设置 RBAC 权限，最后通过测试 Deployment 验证监控效果。

各组件的作用及部署方式如下：

约束限制

Kubernetes 集群版本需为 1.10 及以上。
需提前配置 kubectl 命令行工具。
集群需具备拉取外部镜像的能力，或提前准备好所需镜像。
部署操作均在 k8s01 节点（10.0.5.43/24）上执行。

准备工作

本实践基于三节点集群，节点信息如下：

k8s01：10.0.5.43/24（部署操作节点）
k8s02：10.0.5.74/24
k8s03：10.0.5.29/24

新建命名空间

kubectl create ns monitor

验证：

kubectl get ns | grep monitor

拉取 cAdvisor 镜像

由于官方镜像位于谷歌镜像仓库，需要使用国内镜像源。本实践使用 cadvisordocker/cadvisor:v0.37.0。

docker pull cadvisordocker/cadvisor:v0.37.0

创建工作目录

注：配置文件较多，建议单独新建一个目录统一管理。

mkdir -p /opt/cadvisor_prome_gra
cd /opt/cadvisor_prome_gra

实践流程

部署 cAdvisor（DaemonSet）采集容器资源指标
部署 Node Exporter（DaemonSet + Service）采集节点指标
部署 Prometheus（ConfigMap + Deployment + Service）汇聚并存储指标
配置 RBAC 权限（ServiceAccount、ClusterRole、ClusterRoleBinding）
部署 kube-state-metrics 采集集群状态指标
部署 Grafana 对接 Prometheus 数据源并导入仪表板
创建测试 Deployment 验证监控效果

操作步骤

步骤一：部署 cAdvisor

创建 DaemonSet 配置文件 case1-daemonset-deploy-cadvisor.yaml：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cadvisor
  namespace: monitor
spec:
  selector:
    matchLabels:
      app: cAdvisor
  template:
    metadata:
      labels:
        app: cAdvisor
    spec:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
      hostNetwork: true
      restartPolicy: Always
      containers:
        - name: cadvisor
          image: cadvisordocker/cadvisor:v0.37.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: root
              mountPath: /rootfs
            - name: run
              mountPath: /var/run
            - name: sys
              mountPath: /sys
            - name: docker
              mountPath: /var/lib/containerd
      volumes:
        - name: root
          hostPath:
            path: /
        - name: run
          hostPath:
            path: /var/run
        - name: sys
          hostPath:
            path: /sys
        - name: docker
          hostPath:
            path: /var/lib/containerd

应用配置并验证：

kubectl apply -f case1-daemonset-deploy-cadvisor.yaml
kubectl get pod -n monitor -o wide

验证结果：因集群有 3 个节点，预期运行 3 个 Pod，状态均为 Running。

kubectl get pod -n monitor
# 预期输出
NAME            READY   STATUS    RESTARTS   AGE
cadvisor-79g2l  1/1     Running   0          15m
cadvisor-q2qdz  1/1     Running   0          15m
cadvisor-sdbww  1/1     Running   0          15m

测试 cAdvisor 数据采集：通过浏览器访问 <节点 IP>:8080。

注：首次打开加载较慢，请耐心等待。

步骤二：部署 Node Exporter

创建 DaemonSet 与 Service 配置文件 case2-daemonset-deploy-node-exporter.yaml：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitor
  labels:
    k8s-app: node-exporter
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
  template:
    metadata:
      labels:
        k8s-app: node-exporter
    spec:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
      containers:
        - image: prom/node-exporter:v1.3.1
          imagePullPolicy: IfNotPresent
          name: prometheus-node-exporter
          ports:
            - containerPort: 9100
              hostPort: 9100
              protocol: TCP
              name: metrics
          volumeMounts:
            - mountPath: /host/proc
              name: proc
            - mountPath: /host/sys
              name: sys
            - mountPath: /host
              name: rootfs
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/host
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
      hostNetwork: true
      hostPID: true
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: "true"
  labels:
    k8s-app: node-exporter
  name: node-exporter
  namespace: monitor
spec:
  type: NodePort
  ports:
    - name: http
      port: 9100
      nodePort: 30000
      protocol: TCP
  selector:
    k8s-app: node-exporter

应用配置并验证数据采集。

kubectl apply -f case2-daemonset-deploy-node-exporter.yaml
kubectl get pod -n monitor

通过浏览器访问 <节点 IP>:9100，点击 Metrics 查看指标数据。

步骤三：部署 Prometheus

创建 ConfigMap case3-1-prometheus-cfg.yaml，配置采集任务：

kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 1m
    scrape_configs:
      - job_name: 'kubernetes-node'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - source_labels: [__address__]
            regex: '(.*):10250'
            replacement: '${1}:9100'
            target_label: __address__
            action: replace
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
      - job_name: 'kubernetes-node-cadvisor'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      - job_name: 'kubernetes-apiserver'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_service_name

kubectl apply -f case3-1-prometheus-cfg.yaml

准备数据存储目录（在 k8s01 节点）：

mkdir -p /data/prometheusdata
chmod 777 /data/prometheusdata/
chown 65534.65534 /data/prometheusdata/ -R

创建 Deployment case3-2-prometheus-deployment.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitor
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: server
  template:
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'
    spec:
      nodeName: k8s01
      serviceAccountName: monitor
      containers:
        - name: prometheus
          image: prom/prometheus:v2.31.2
          imagePullPolicy: IfNotPresent
          command:
            - prometheus
            - --config.file=/etc/prometheus/prometheus.yml
            - --storage.tsdb.path=/prometheus
            - --storage.tsdb.retention=720h
          ports:
            - containerPort: 9090
              protocol: TCP
          volumeMounts:
            - mountPath: /etc/prometheus/prometheus.yml
              name: prometheus-config
              subPath: prometheus.yml
            - mountPath: /prometheus/
              name: prometheus-storage-volume
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
            items:
              - key: prometheus.yml
                path: prometheus.yml
                mode: 0644
        - name: prometheus-storage-volume
          hostPath:
            path: /data/prometheusdata
            type: Directory

创建 ServiceAccount 和 ClusterRoleBinding，并应用 Deployment：

kubectl create serviceaccount monitor -n monitor
kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor --clusterrole=cluster-admin --serviceaccount=monitor:monitor
kubectl apply -f case3-2-prometheus-deployment.yaml

创建 Service 配置文件 case3-3-prometheus-svc.yaml 并应用：

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitor
  labels:
    app: prometheus
spec:
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      nodePort: 30090
      protocol: TCP
  selector:
    app: prometheus
    component: server

kubectl apply -f case3-3-prometheus-svc.yaml
kubectl get svc -n monitor

步骤四：配置 RBAC 权限

创建 RBAC 配置文件 case4-prom-rbac.yaml，包含 ServiceAccount、Secret、ClusterRole 和 ClusterRoleBinding：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitor
---
apiVersion: v1
kind: Secret
metadata:
  name: monitor-token
  namespace: monitor
  annotations:
    kubernetes.io/service-account.name: "prometheus"
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - services
      - endpoints
      - pods
      - nodes/proxy
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "extensions"
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
      - nodes/metrics
    verbs:
      - get
  - nonResourceURLs:
      - /metrics
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitor

kubectl apply -f case4-prom-rbac.yaml

步骤五：部署 kube-state-metrics

创建组合配置文件 case5-kube-state-metrics-deploy.yaml，包含 Deployment、ServiceAccount、ClusterRole、ClusterRoleBinding 和 Service。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
        - name: kube-state-metrics
          image: registry.cn-beijing.ksyuncs.com/zbl/kube-state-metrics:v2.6.0
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
  - apiGroups: [""]
    resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
    verbs: ["list", "watch"]
  - apiGroups: ["extensions"]
    resources: ["daemonsets", "deployments", "replicasets"]
    verbs: ["list", "watch"]
  - apiGroups: ["apps"]
    resources: ["statefulsets"]
    verbs: ["list", "watch"]
  - apiGroups: ["batch"]
    resources: ["cronjobs", "jobs"]
    verbs: ["list", "watch"]
  - apiGroups: ["autoscaling"]
    resources: ["horizontalpodautoscalers"]
    verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
  - kind: ServiceAccount
    name: kube-state-metrics
    namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  name: kube-state-metrics
  namespace: kube-system
  labels:
    app: kube-state-metrics
spec:
  type: NodePort
  ports:
    - name: kube-state-metrics
      port: 8080
      targetPort: 8080
      nodePort: 31666
      protocol: TCP
  selector:
    app: kube-state-metrics

kubectl apply -f case5-kube-state-metrics-deploy.yaml

验证部署：

kubectl get sa -n kube-system | grep kube-state-metrics
kubectl get clusterrole -n kube-system | grep kube-state-metrics
kubectl get clusterrolebinding -n kube-system | grep kube-state-metrics

步骤六：配置 Grafana 数据源和仪表板

部署 Grafana，创建 grafana-enterprise.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana-enterprise
  namespace: monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana-enterprise
  template:
    metadata:
      labels:
        app: grafana-enterprise
    spec:
      containers:
        - image: grafana/grafana
          imagePullPolicy: Always
          securityContext:
            allowPrivilegeEscalation: false
            runAsUser: 0
          name: grafana
          ports:
            - containerPort: 3000
              protocol: TCP
          volumeMounts:
            - mountPath: "/var/lib/grafana"
              name: data
          resources:
            requests:
              cpu: 100m
              memory: 100Mi
            limits:
              cpu: 500m
              memory: 2500Mi
      volumes:
        - name: data
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitor
spec:
  type: NodePort
  ports:
    - port: 80
      targetPort: 3000
      nodePort: 31000
  selector:
    app: grafana-enterprise

kubectl apply -f grafana-enterprise.yaml

访问 Grafana：通过 http://<节点 IP>:31000 登录，默认用户名 admin，密码 admin123。

添加 Prometheus 数据源：进入 Configuration > Data Sources > Add data source，选择 Prometheus，命名为 prometheus，填写 Prometheus Service 地址。

注：URL 可使用集群内部 IP 或外部 EIP，Prometheus Service 端口为 30090。

导入监控仪表板：
- 点击 + > Import。
- 输入仪表板 ID：Node Exporter Full（11074）、Kubernetes Cluster（6417）、Kubernetes Pods（6336）。
- 为每个仪表板选择上一步创建的 Prometheus 数据源。
- 点击 Import 完成导入。

步骤七：测试监控效果

创建测试 Deployment nginx01.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx01
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx01
  template:
    metadata:
      labels:
        app: nginx01
    spec:
      containers:
        - name: nginx
          image: nginx:1.7.9

kubectl apply -f nginx01.yaml

验证 Deployment 状态：

kubectl get deployments.apps
# 预期输出
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
nginx01   2/2     2            2           55s

在 Grafana 仪表板中查看 nginx01 的容器资源监控数据，确认数据正常展示。

附录：常见问题排查

cAdvisor Pod 无法启动：查看 Pod 日志 kubectl logs -n monitor -l app=cadvisor。常见原因为内核版本不兼容，可尝试降级镜像版本。
Prometheus Target 状态为 DOWN：依次检查 Prometheus 配置 kubectl describe configmap -n monitor prometheus-config、Service 端点 kubectl get endpoints -n monitor，并确认 Pod 标签与 Service 选择器匹配。
Grafana 仪表板无数据：
- 确认数据源 URL 配置正确（建议使用 http://prometheus.monitor.svc.cluster.local:9090）。
- 等待约 5 分钟后刷新页面，确保 Prometheus 已采集到数据。
- 将时间范围调整为 Last 5 minutes。
- 检查 Dashboard 变量（instance/node）是否与实际环境匹配。
镜像拉取失败：国内环境可替换为以下镜像源：

组件	替代镜像
cAdvisor	`docker.io/google/cadvisor:v0.47.2`
Node Exporter	`quay.io/prometheus/node-exporter:v1.7.0`
Prometheus	`docker.io/prom/prometheus:v2.52.0`
Grafana	`docker.io/grafana/grafana:10.4.0`

文档导读

上一篇：监控

下一篇：调度

纯净模式常规模式

纯净模式

点击可全屏预览文档内容