全部文档
当前文档

共搜索到 0 条结果

暂无内容

如果没有找到您期望的内容,请尝试其他搜索词

文档中心

K8S集群部署开源监控的实践教程

最近更新时间:2026-06-22 15:20:50

背景信息

Kubernetes 集群上部署监控系统,其核心组件包括 cAdvisor(容器资源采集)、Node Exporter(节点指标采集)、Prometheus(指标汇聚存储)和 Grafana(可视化展示)。通过 DaemonSet 确保每个节点上的监控组件运行,配置 Service 暴露端口,以及设置 RBAC 权限,最后通过测试 Deployment 验证监控效果。

各组件的作用及部署方式如下:

约束限制

  • Kubernetes 集群版本需为 1.10 及以上。

  • 需提前配置 kubectl 命令行工具。

  • 集群需具备拉取外部镜像的能力,或提前准备好所需镜像。

  • 部署操作均在 k8s01 节点(10.0.5.43/24)上执行。

准备工作

本实践基于三节点集群,节点信息如下:

  • k8s01:10.0.5.43/24(部署操作节点)

  • k8s02:10.0.5.74/24

  • k8s03:10.0.5.29/24

新建命名空间

kubectl create ns monitor

验证:

kubectl get ns | grep monitor

拉取 cAdvisor 镜像

由于官方镜像位于谷歌镜像仓库,需要使用国内镜像源。本实践使用 cadvisordocker/cadvisor:v0.37.0

docker pull cadvisordocker/cadvisor:v0.37.0

创建工作目录

注:配置文件较多,建议单独新建一个目录统一管理。

mkdir -p /opt/cadvisor_prome_gra
cd /opt/cadvisor_prome_gra

实践流程

  1. 部署 cAdvisor(DaemonSet)采集容器资源指标

  2. 部署 Node Exporter(DaemonSet + Service)采集节点指标

  3. 部署 Prometheus(ConfigMap + Deployment + Service)汇聚并存储指标

  4. 配置 RBAC 权限(ServiceAccount、ClusterRole、ClusterRoleBinding)

  5. 部署 kube-state-metrics 采集集群状态指标

  6. 部署 Grafana 对接 Prometheus 数据源并导入仪表板

  7. 创建测试 Deployment 验证监控效果

操作步骤

步骤一:部署 cAdvisor

  1. 创建 DaemonSet 配置文件 case1-daemonset-deploy-cadvisor.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cadvisor
  namespace: monitor
spec:
  selector:
    matchLabels:
      app: cAdvisor
  template:
    metadata:
      labels:
        app: cAdvisor
    spec:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
      hostNetwork: true
      restartPolicy: Always
      containers:
        - name: cadvisor
          image: cadvisordocker/cadvisor:v0.37.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: root
              mountPath: /rootfs
            - name: run
              mountPath: /var/run
            - name: sys
              mountPath: /sys
            - name: docker
              mountPath: /var/lib/containerd
      volumes:
        - name: root
          hostPath:
            path: /
        - name: run
          hostPath:
            path: /var/run
        - name: sys
          hostPath:
            path: /sys
        - name: docker
          hostPath:
            path: /var/lib/containerd
  1. 应用配置并验证:

kubectl apply -f case1-daemonset-deploy-cadvisor.yaml
kubectl get pod -n monitor -o wide
  1. 验证结果:因集群有 3 个节点,预期运行 3 个 Pod,状态均为 Running

kubectl get pod -n monitor
# 预期输出
NAME            READY   STATUS    RESTARTS   AGE
cadvisor-79g2l  1/1     Running   0          15m
cadvisor-q2qdz  1/1     Running   0          15m
cadvisor-sdbww  1/1     Running   0          15m
  1. 测试 cAdvisor 数据采集:通过浏览器访问 <节点 IP>:8080

注:首次打开加载较慢,请耐心等待。

步骤二:部署 Node Exporter

  1. 创建 DaemonSet 与 Service 配置文件 case2-daemonset-deploy-node-exporter.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitor
  labels:
    k8s-app: node-exporter
spec:
  selector:
    matchLabels:
      k8s-app: node-exporter
  template:
    metadata:
      labels:
        k8s-app: node-exporter
    spec:
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
      containers:
        - image: prom/node-exporter:v1.3.1
          imagePullPolicy: IfNotPresent
          name: prometheus-node-exporter
          ports:
            - containerPort: 9100
              hostPort: 9100
              protocol: TCP
              name: metrics
          volumeMounts:
            - mountPath: /host/proc
              name: proc
            - mountPath: /host/sys
              name: sys
            - mountPath: /host
              name: rootfs
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/host
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /
      hostNetwork: true
      hostPID: true
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: "true"
  labels:
    k8s-app: node-exporter
  name: node-exporter
  namespace: monitor
spec:
  type: NodePort
  ports:
    - name: http
      port: 9100
      nodePort: 30000
      protocol: TCP
  selector:
    k8s-app: node-exporter
  1. 应用配置并验证数据采集。

kubectl apply -f case2-daemonset-deploy-node-exporter.yaml
kubectl get pod -n monitor
  1. 通过浏览器访问 <节点 IP>:9100,点击 Metrics 查看指标数据。

步骤三:部署 Prometheus

  1. 创建 ConfigMap case3-1-prometheus-cfg.yaml,配置采集任务:

kind: ConfigMap
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 1m
    scrape_configs:
      - job_name: 'kubernetes-node'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - source_labels: [__address__]
            regex: '(.*):10250'
            replacement: '${1}:9100'
            target_label: __address__
            action: replace
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
      - job_name: 'kubernetes-node-cadvisor'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      - job_name: 'kubernetes-apiserver'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_service_name
kubectl apply -f case3-1-prometheus-cfg.yaml
  1. 准备数据存储目录(在 k8s01 节点):

mkdir -p /data/prometheusdata
chmod 777 /data/prometheusdata/
chown 65534.65534 /data/prometheusdata/ -R
  1. 创建 Deployment case3-2-prometheus-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-server
  namespace: monitor
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
      component: server
  template:
    metadata:
      labels:
        app: prometheus
        component: server
      annotations:
        prometheus.io/scrape: 'false'
    spec:
      nodeName: k8s01
      serviceAccountName: monitor
      containers:
        - name: prometheus
          image: prom/prometheus:v2.31.2
          imagePullPolicy: IfNotPresent
          command:
            - prometheus
            - --config.file=/etc/prometheus/prometheus.yml
            - --storage.tsdb.path=/prometheus
            - --storage.tsdb.retention=720h
          ports:
            - containerPort: 9090
              protocol: TCP
          volumeMounts:
            - mountPath: /etc/prometheus/prometheus.yml
              name: prometheus-config
              subPath: prometheus.yml
            - mountPath: /prometheus/
              name: prometheus-storage-volume
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
            items:
              - key: prometheus.yml
                path: prometheus.yml
                mode: 0644
        - name: prometheus-storage-volume
          hostPath:
            path: /data/prometheusdata
            type: Directory
  1. 创建 ServiceAccount 和 ClusterRoleBinding,并应用 Deployment:

kubectl create serviceaccount monitor -n monitor
kubectl create clusterrolebinding monitor-clusterrolebinding -n monitor --clusterrole=cluster-admin --serviceaccount=monitor:monitor
kubectl apply -f case3-2-prometheus-deployment.yaml
  1. 创建 Service 配置文件 case3-3-prometheus-svc.yaml 并应用:

apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitor
  labels:
    app: prometheus
spec:
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
      nodePort: 30090
      protocol: TCP
  selector:
    app: prometheus
    component: server
kubectl apply -f case3-3-prometheus-svc.yaml
kubectl get svc -n monitor

步骤四:配置 RBAC 权限

创建 RBAC 配置文件 case4-prom-rbac.yaml,包含 ServiceAccount、Secret、ClusterRole 和 ClusterRoleBinding:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitor
---
apiVersion: v1
kind: Secret
metadata:
  name: monitor-token
  namespace: monitor
  annotations:
    kubernetes.io/service-account.name: "prometheus"
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - services
      - endpoints
      - pods
      - nodes/proxy
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "extensions"
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
      - nodes/metrics
    verbs:
      - get
  - nonResourceURLs:
      - /metrics
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitor
kubectl apply -f case4-prom-rbac.yaml

步骤五:部署 kube-state-metrics

创建组合配置文件 case5-kube-state-metrics-deploy.yaml,包含 Deployment、ServiceAccount、ClusterRole、ClusterRoleBinding 和 Service。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
        - name: kube-state-metrics
          image: registry.cn-beijing.ksyuncs.com/zbl/kube-state-metrics:v2.6.0
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
  - apiGroups: [""]
    resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
    verbs: ["list", "watch"]
  - apiGroups: ["extensions"]
    resources: ["daemonsets", "deployments", "replicasets"]
    verbs: ["list", "watch"]
  - apiGroups: ["apps"]
    resources: ["statefulsets"]
    verbs: ["list", "watch"]
  - apiGroups: ["batch"]
    resources: ["cronjobs", "jobs"]
    verbs: ["list", "watch"]
  - apiGroups: ["autoscaling"]
    resources: ["horizontalpodautoscalers"]
    verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
  - kind: ServiceAccount
    name: kube-state-metrics
    namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  name: kube-state-metrics
  namespace: kube-system
  labels:
    app: kube-state-metrics
spec:
  type: NodePort
  ports:
    - name: kube-state-metrics
      port: 8080
      targetPort: 8080
      nodePort: 31666
      protocol: TCP
  selector:
    app: kube-state-metrics
kubectl apply -f case5-kube-state-metrics-deploy.yaml

验证部署:

kubectl get sa -n kube-system | grep kube-state-metrics
kubectl get clusterrole -n kube-system | grep kube-state-metrics
kubectl get clusterrolebinding -n kube-system | grep kube-state-metrics

步骤六:配置 Grafana 数据源和仪表板

  1. 部署 Grafana,创建 grafana-enterprise.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana-enterprise
  namespace: monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana-enterprise
  template:
    metadata:
      labels:
        app: grafana-enterprise
    spec:
      containers:
        - image: grafana/grafana
          imagePullPolicy: Always
          securityContext:
            allowPrivilegeEscalation: false
            runAsUser: 0
          name: grafana
          ports:
            - containerPort: 3000
              protocol: TCP
          volumeMounts:
            - mountPath: "/var/lib/grafana"
              name: data
          resources:
            requests:
              cpu: 100m
              memory: 100Mi
            limits:
              cpu: 500m
              memory: 2500Mi
      volumes:
        - name: data
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitor
spec:
  type: NodePort
  ports:
    - port: 80
      targetPort: 3000
      nodePort: 31000
  selector:
    app: grafana-enterprise
kubectl apply -f grafana-enterprise.yaml
  1. 访问 Grafana:通过 http://<节点 IP>:31000 登录,默认用户名 admin,密码 admin123

  1. 添加 Prometheus 数据源:进入 Configuration > Data Sources > Add data source,选择 Prometheus,命名为 prometheus,填写 Prometheus Service 地址。

注:URL 可使用集群内部 IP 或外部 EIP,Prometheus Service 端口为 30090

  1. 导入监控仪表板:

    • 点击 + > Import

    • 输入仪表板 ID:Node Exporter Full(11074)、Kubernetes Cluster(6417)、Kubernetes Pods(6336)。

    • 为每个仪表板选择上一步创建的 Prometheus 数据源。

    • 点击 Import 完成导入。

步骤七:测试监控效果

  1. 创建测试 Deployment nginx01.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx01
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx01
  template:
    metadata:
      labels:
        app: nginx01
    spec:
      containers:
        - name: nginx
          image: nginx:1.7.9
kubectl apply -f nginx01.yaml
  1. 验证 Deployment 状态:

kubectl get deployments.apps
# 预期输出
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
nginx01   2/2     2            2           55s
  1. 在 Grafana 仪表板中查看 nginx01 的容器资源监控数据,确认数据正常展示。

附录:常见问题排查

  • cAdvisor Pod 无法启动:查看 Pod 日志 kubectl logs -n monitor -l app=cadvisor。常见原因为内核版本不兼容,可尝试降级镜像版本。

  • Prometheus Target 状态为 DOWN:依次检查 Prometheus 配置 kubectl describe configmap -n monitor prometheus-config、Service 端点 kubectl get endpoints -n monitor,并确认 Pod 标签与 Service 选择器匹配。

  • Grafana 仪表板无数据

    • 确认数据源 URL 配置正确(建议使用 http://prometheus.monitor.svc.cluster.local:9090)。

    • 等待约 5 分钟后刷新页面,确保 Prometheus 已采集到数据。

    • 将时间范围调整为 Last 5 minutes

    • 检查 Dashboard 变量(instance/node)是否与实际环境匹配。

  • 镜像拉取失败:国内环境可替换为以下镜像源:

组件

替代镜像

cAdvisor

docker.io/google/cadvisor:v0.47.2

Node Exporter

quay.io/prometheus/node-exporter:v1.7.0

Prometheus

docker.io/prom/prometheus:v2.52.0

Grafana

docker.io/grafana/grafana:10.4.0

文档导读
纯净模式常规模式

纯净模式

点击可全屏预览文档内容
文档反馈