全部文档
当前文档

暂无内容

如果没有找到您期望的内容,请尝试其他搜索词

文档中心

net-exporter 容器网络可观测工具最佳实践

最近更新时间:2026-01-28 14:48:26

在容器化集群环境中,网络相关问题排查涉及维度多、链路复杂,传统排查方式不仅耗时费力,还易遗漏关键排查方向。为提升容器网络问题排查效率,精准定位网络异常根因,开发并上线了容器网络可观测性工具 net-exporter,为集群网络提供全链路观测能力。

核心架构与部署模式

net-exporter 采用 DaemonSet 模式部署,可确保集群内每个节点均能部署该组件,实现全节点网络数据覆盖。

数据采集方式

组件支持通过多种方式采集宿主机及 Pod 级别的全维度网络观测信息,覆盖网络协议栈全链路,具体包括:

  • /proc 文件系统

  • eBPF 动态追踪技术

  • 系统调用接口(如 netlink、conntrack 等)

  • Linux 网络命令(如 ifconfig、iptables 等)

  • dmesg 内核日志

  • Kubernetes API

数据输出模式

通过上述方式采集的观测指标,统一通过 /metrics 接口暴露,采用 Pull 模式供监控系统采集;

事件类数据支持两种输出方式:标准输出(stdout)和推送到 Grafana Loki 日志系统,满足不同场景下的日志存储与分析需求。

核心观测指标描述

net-exporter 覆盖 conntrack、qdisc、netdev、TCP 协议栈等多个核心观测点,提供丰富的量化指标,精准反映网络运行状态。各观测点指标详情如下:

观测点

指标名称

显示名称

描述

conntrack

netexporter_conntrack_found

Conntrack Found times

成功查找到ct记录的次数。示例:

netexporter_conntrack_found{k8s_node="192.168.88.112""} 10

其他同观测点的示例同略

netexporter_conntrack_invalid

Conntrack Invalid times

在ct创建过程中由于各种原因无法建立,但是报文并未被丢弃的次数

netexporter_conntrack_ignore

Conntrack Ignore times

由于ct已经建立或者协议不需要维护ct而跳过的次数

netexporter_conntrack_insert

Conntrack Insert times

往表中插入条目的个数。目前不会计数,为0

netexporter_conntrack_insertfailed

Conntrack Insert failed times

插入表失败,比如nat选择了相同的源地址和端口,或者snat无端口可用。

netexporter_conntrack_drop

Conntrack Drop times

由于ct创建过程中无法建立而丢弃报文的次数。

netexporter_conntrack_earlydrop

Conntrack Early drop times

conntrack表已满,丢弃了不是双向通信的现有连接的记录,可能会导致现有连接网络不稳定。

netexporter_conntrack_error

Conntrack Error times

收到与已有连接不匹配的icmp(v6)报文个数

netexporter_conntrack_searchrestart

Conntrack Search restart times

查找ct过程中由于查找失败而进行重试的次数

netexporter_conntrack_entries

Conntrack Entries

当前存在的entry的数量

netexporter_conntrack_maxentries

Conntrack Max Entries

支持的最大entry数量

qdisc

netexporter_qdisc_bytes

TC Qdisc bytes

当前节点上所有以cali和eth开头的设备,在TC Qisc层发送的字节数之和。示例:

netexporter_qdisc_bytes{k8s_node="192.168.88.112""} 100

其他同观测点的示例同略

netexporter_qdisc_packets

TC Qdisc Packets

当前节点上所有以cali和eth开头的设备,在TC Qisc层发送的报文数之和

netexporter_qdisc_drops

TC Qdisc Drops

当前节点上所有以cali和eth开头的设备,在TC Qisc层drop的报文数之和

netexporter_qdisc_overlimits

TC Qdisc Overlimits

当前节点上所有以cali和eth开头的设备,在TC Qisc层超过limits配置的报文数之和

netexporter_qdisc_backlog

TC Qdisc Backlog

当前节点上所有以cali和eth开头的设备,在TC Qisc层目前仍在队列中的字节数之和

netexporter_qdisc_qlen

TC Qdisc Qlen

当前节点上所有以cali和eth开头的设备,TC 队列的长度之和

netdev

netexporter_netdev_rxbytes

Network Device RX bytes

网卡设备接收的总字节数。示例:

netexporter_netdev_rxbytes{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1", if_name="eth0"} 3000

其他同观测点的示例同略

netexporter_netdev_rxerrors

Network Device RX errors(times)

网卡设备出现接收错误的次数

netexporter_netdev_rxpackets

Network Device RX packets

网卡设备接收的总报文数

netexporter_netdev_rxdropped

Network Device RX dropped(times)

网卡接收错误并产生丢弃的次数

netexporter_netdev_txbytes

Network Device TX bytes

网卡设备发送的总字节数

netexporter_netdev_txerrors

Network Device TX errors(times)

网卡设备出现发送错误的次数

netexporter_netdev_rxfifos

Ring Buffer RX overruns/fifo(times)

RingBuffer RX队列溢出统计

netexporter_netdev_txfifos

Ring Buffer TX overruns/fifo(times)

RingBuffer TX队列溢出统计

netexporter_netdev_rxcrcerrors

Network Device RX CRC errors(times)

CRC校验错误个数统计

netexporter_netdev_txpackets

Network Device TX packets

网卡设备发送的报文数

netexporter_netdev_txdropped

Network Device TX dropped(times)

网卡发送错误并产生丢弃的次数

tcp

netexporter_tcp_currestab

TCP Current Conns established

TCP当前存在的活跃连接数

netexporter_tcp_currestab{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1"} 30

其他同观测点的示例同略

netexporter_tcp_attemptfails

TCP Attempt fails(times)

TCP尝试建立连接但是最终失败的总次数

netexporter_tcp_estabresets

TCP Established Conns resets(times)

异常关闭TCP连接的次数

netexporter_tcp_inerrs

TCP Received errors

TCP层接收到的错误报文总次数

netexporter_tcp_insegs

TCP Received segs

TCP层接收到的有效报文段总数

netexporter_tcp_outrsts

TCP Send Resets

TCP发送的reset报文次数

netexporter_tcp_outsegs

TCP Send segs

TCP层发送的有效报文段总数

netexporter_tcp_activeopens

TCP Active Conns

TCP成功发起SYN初次握手的次数,不包括SYN的重传,但是连接建立失败也会导致这个指标上升。

netexporter_tcp_passiveopens

TCP Passive Conns

TCP完成握手并成功分配sock的累积值,通常可以理解为成功新建连接的数量。

inetexporter_tcp_retransseg

TCP Retransmit segs

TCP重传的总报文数,这里已经根据tso进行了提前的分片计算。

tcpext

netexporter_tcpext_listenoverflows

TCP Listen Overflows(times)

当LISTEN状态的Sock接受连接时出现半连接队列溢出时会计数

netexporter_tcpext_listenoverflows{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1"} 0

其他同观测点的示例同略

netexporter_tcpext_listendrops

TCP Listen Drops(times)

当LISTEN状态的Sock创建SYN_RECV状态的Sock失败时会计数

netexporter_tcpext_tcpsynretrans

TCP SYN Retransmit(times)

重传的SYN报文次数。

netexporter_tcpext_tcpfastretrans

TCP Fast Retransmit(times)

CA状态部位loss时进行的重传均会进行计数

inetexporter_tcpext_tcpretransfail

TCP Retransmit Fail(times)

重传报文返回除了EBUSY之外的报错时计数,说明重传无法正常完成。

netexporter_tcpext_tcptimeouts

TCP Timeouts(times)

CA状态并未进入recovery/loss/disorder时触发,当SYN报文未得到回复时进行重传会计数。

netexporter_tcpext_tcpabortonclose

TCP Conn Abort on close(times)

状态机之外的原因关闭TCP连接时,仍有数据没有读取而发送Reset报文,则会进行指标计数

netexporter_tcpext_tcpabortonmemory

TCP Conn Abort on memory(times)

在需要分配tw_sock/tcp_sock等逻辑中由于tcp_check_oom触发内存不足而发送Reset结束连接的次数

netexporter_tcpext_tcpabortontimeout

TCP Conn Abort on timeout (times)

由于keepalive/window probe/重传的调用超过上限发送Reset时会更新此计数

netexporter_tcpext_tcpabortonlinger

TCP Conn Abort on linger timeout(times)

TCP的Linger2选项开启后,快速回收处于FIN_WAIT2的连接时发送Reset的次数

netexporter_tcpext_tcpabortondata

TCP Conn Abort on data(times)

由于Linger/Linger2选项开启而通过Reset进行连接的快速回收时发送Reset的计数

netexporter_tcpext_tcpabortfailed

TCP Conn Abort on failed(times)

统计尝试终止TCP连接但发送RST包失败的次数

netexporter_tcpext_tcpackskippedsynrecv

TCP ACK Skipped in SYN_RECV(times)

在SYN_RECV状态的sock不回复ACK的次数。

netexporter_tcpext_tcpackskippedpaws

TCP ACK Skipped due to PAWS(times)

由于paws机制触发校正,但是oow限速限制了ACK报文发送的次数。

netexporter_tcpext_tcpackskippedseq

TCP ACK Skipped due to Seq(times)

由于序号在窗口外触发较正,但是被oow限速限制了ACK报文发送的次数。

netexporter_tcpext_tcpackskippedfinwait2

TCP ACK Skipped in FIN_WAIT_2(times)

在fin_wait_2状态下,对于oow报文发送ack,但是因为限速而忽略发送的次数。

netexporter_tcpext_tcpackskippedtimewait

TCP ACK Skipped in TIME_WAIT(times)

在fin_wait_2状态下,对于oow报文发送ack,但是因为限速而忽略发送的次数。

netexporter_tcpext_tcpackskippedchallenge

TCP ACK Skipped due to challenges(times)

在需要发送challenge ack(通常用于确认reset报文)时被oow限速的次数。

netexporter_tcpext_tcprcvqdrop

TCP RX Packets dropped(queue overflow)

当TCP的recv队列出现堆积,并且无法正常分配到内存时,会进行这项计数。

netexporter_tcpext_pawsestab

TCP Conns Using PAWS

通过PAWS机制建立的TCP连接数量

netexporter_tcpext_tcpwinprobe

TCP Windows Probe times

发送方主动探测接收方窗口状态的次数

netexporter_tcpext_tcpkeepalive

TCP Keepalive Packtes send

通过TCP keep-alive机制发送的保活探测报文数量

netexporter_tcpext_tcpmtupfail

TCP MTU Probe fails(times)

统计TCP路径MTU探测失败次数

netexporter_tcpext_tcpmtupsuccess

TCP MUT Probe success(times)

统计TCP 路径 MTU探测成功次数

netexporter_tcpext_tcpzerowindowdrop

TCP Packets Dropped(Zero Window)

接收方因零窗口丢弃的数据包数量

netexporter_tcpext_tcpbacklogdrop

TCP Packetes Dropped(backlog queue)

因TCP全连接队列溢出导致的新连接丢弃次数

netexporter_tcpext_pfmemallocdrop

TCP Packetes Dropped(PF_MEMALLOC fail)

因内存压力导致TCP层无法分配缓冲区而丢弃的数据包数量

netexporter_tcpext_tcpwqueuetoobig

TCP Send Queue Too Big(times)

统计TCP发送队列因容量超限导致数据包被丢弃或无法发送的次数

netexporter_tcpext_embryonicrsts

TCP Half-open Conns reseted

TCP连接在未完成三次握手阶段收到RST包导致连接终止的次数

netexporter_tcpext_tcpmemorypressures

TCP Memory Pressure times

进入内核协议栈内存紧缺状态的次数

netexporter_tcpext_tcpmemorypressureschrono

TCP Memory Pressure events

用于统计TCP协议层内存压力状态持续累积时间

sock

netexporter_sock_inuse

TCP Sockets In Using

活跃的TCP套接字数量(包括watch状态和已建立的连接

netexporter_sock_inuse{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1"} 18

其他同观测点的示例同略

netexporter_sock_orphan

TCP Sockets orphaned

无主(未关联任何进程)的TCP连接数,通常由异常断开导致‌

netexporter_sock_tw

TCP Sockets In TIME_WAIT

处于TIME_WAIT状态的套接字数量,用于确保连接关闭的可靠性‌。

netexporter_sock_alloc

TCP Sockets allocated

已分配的TCP套接字总数(包含已使用和待回收的套接字)‌

netexporter_sock_mem

TCP Sockets Memory Pages

TCP套接字占用的内存总量(单位为内存页,通常每页4KB)‌

netexporter_sock_inuse_udp

UDP Sockets In Using

当前活跃的UDP套接字数量‌

netexporter_sock_mem_udp

UDP Sockets Memory Pages

UDP套接字占用的内存量‌

softnetstat

netexporter_softnet_processed

Packets in CPU Backlog

单个Pod内所有CPU处理的从网卡放入CPU的Backlog的报文数量

netexporter_softnet_processed{k8s_namespace="kube-system",k8s_node="192.168.88.112",k8s_pod="vpc-cni-scheduler-5d5d56bccc-55nrz"} 2.6343316e+07

netexporter_softnet_dropped

Packets Dropped by CPU Backlog

单个Pod内所有CPU处理的从网卡放入CPU的backlog失败并丢弃报文数量

netexporter_softnet_dropped{k8s_namespace="kube-system",k8s_node="192.168.88.112",k8s_pod="vpc-cni-scheduler-5d5d56bccc-55nrz"} 0

tcpsummary

netexporter_tcpsummary_tcpestablishedconn

Tcpsummary Established Conns

当前处于活跃连接状态的TCP连接总数。示例:

netexporter_tcpsummary_tcpestablishedconn{namespace="kube-system",node="192.168.88.112",pod="vpc-cni-scheduler-5d5d56bccc-55nrz"} 2

其他同观测点的示例同略

netexporter_tcpsummary_tcptimewaitconn

Tcpsummary TIME_WAIT Conns

tcptimewaitconn:当前存在的TIMEWAIT状态的TCP连接数量

netexporter_tcpsummary_tcptxqueue

Tcpsummary TX Queue Bytes

tcptxqueue:当前处ESTABLISHED状态的TCP连接的发送队列中存在的数据包的Bytes总数

netexporter_tcpsummary_tcprxqueue

Tcpsummary RX Queue Bytes

tcprxqueue:当前处ESTABLISHED状态的TCP连接的接收队列中存在的数据包的Bytes总数

packetloss

netexporter_packetloss_netfilter

Packet Loss times(Netfilter)

Netfilter层的丢包统计:

netexporter_packetloss_netfilter{k8s_node="192.168.88.112""} 1

netexporter_packetloss_otherhost

Packet Loss times(OtherHost)

报文的目的mac与当前网卡的mac不一致:

netexporter_packetloss_otherhost{k8s_node="192.168.88.112""} 1

netexporter_packetloss_nosocket

Packet Loss times(No Socket)

找不到socket导致的丢包统计:

netexporter_packetloss_nosocket{k8s_node="192.168.88.112""} 1

netexporter_packetloss_iprpfilter

Packet Loss times(IP rpfilter)

rpfilter验证没通过导致的丢包统计:

netexporter_packetloss_iprpfilter{k8s_node="192.168.88.112""} 1

netexporter_packetloss_nomem

Packet Loss times(OOM)

内存不足导致的丢包统计:

netexporter_packetloss_nomem{k8s_node="192.168.88.112""} 1

netexporter_packetloss_zerowindow

Packet Loss times(Zero Window)

因TCP窗口为0导致的丢包统计:

netexporter_packetloss_zerowindow{k8s_node="192.168.88.112""} 1

netexporter_packetloss_total

Packet Loss times(Total)

总的丢包统计:

netexporter_packetloss_total{k8s_node="192.168.88.112""} 100

arp

netexporter_arp_maxentries

ARP Max Entries

能支持的arp个数上限

netexporter_arp_entries

ARP Entries

已经存在的arp个数

netexporter_arp_unresolveddiscards

ARP Unresolved Discards

用于控制未解析的地址请求在内核中的排队长度。如果并发arp请求过多超过队列长度,则会drop arp请求

PodIP

netexporter_podip_maxentries

PodIP Max Entries

节点上支持的pod ip个数上限

netexporter_podip_entries

PodIP Used Number

节点上已经被被使用的pod ip个数

netexporter_podip_redisuals

PodIP Redisuals Number

节点上残留的pod ip个数

netexporter_podip_notinpodcidrs

PodIP NotIn PodCIDRs Number

不在当前节点PodCIDR范围内pod个数

accept队列观测

netexporter_sockbacklog_recvq

Socket RecvQ

当前已完成三次握手并等待服务端 accept() 的TCP连接个数,如果持续非0,代表全队列满了

Event指标支持

Event 观测指标聚焦集群网络相关的配置异常、状态异常等关键事件,实时捕捉内核参数、K8s 配置、网络组件、路由、节点配置、MTU 等维度的异常情况,通过结构化日志输出,助力运维人员快速发现并定位配置类、运行类问题。各观测点及对应事件详情如下:

观测点

事件类型

事件日志示例

KernelParam

RPFilterKernelParam

{"timestamp":1741163909810866660,"type":"RPFilterKernelParam","msg":"net.ipv4.conf.calicb6db6f9b07.rp_filter:current_value=1,expect_value=0"}

KernelParam

TCPSyncookiesKernelParam

{"timestamp":1741163909810866660,"type":"TCPSyncookiesKernelParam","msg":"net.ipv4.tcp_syncookies:current_value=0,expect_value=1"}

KernelParam

LiberalKernelParam

{"timestamp":1741163909810866660,"type":"LiberalKernelParam","msg":"net.netfilter.nf_conntrack_tcp_be_liberal:current_value=0,expect_value=1"}

KernelParam

TCPSackKernelParam

{"timestamp":1741163909810866660,"type":"TCPSackKernelParam","msg":"net.ipv4.tcp_sack:current_value=0,expect_value=1"}

KernelParam

ProxyARPKernelParam

{"timestamp":1741163909810866660,"type":"ProxyARPKernelParam","msg":"net.ipv4.conf.eth0.proxy_arp:current_value=1,expect_value=0"}

k8sconfig

NetworkPolicy

{"timestamp":1741163909810866660,"type":"NetworkPolicy","msg":"network policy is configured,please use it carefully."}

k8sconfig

SVCTrafficPolicy

{"timestamp":1741163909810866660,"type":"LocalTrafficPolicy","msg":"externalTrafficPolicy of svc(ng-svc/default,...) is Local, internalTrafficPolicy of svc (ng-svc/default,...) is Local, please check it."}

k8sconfig

IPMasqAgentConf

{"timestamp":1741163909810866660,"type":"IPMasqAgentConf","msg":"NonMasqueradeCIDRs does not contain vpc cidr, it will reduce pod network performance."}

k8sconfig

CIDRConflict

{"timestamp":1741163909810866660,"type":"CIDRConflict","msg":"cidr conflicts:vpc cidr 172.31.0.0/16, svc cidr 10.254.0.0/16, pod cidr 192.168.0.0/16, docker cidr 172.31.0.0/16."}

KubeProxy

KubeProxyAbnormal

{"timestamp":1741163909810866660,"type":"KubeProxy","msg":"kube-proxy looks abnormal, please check it."}

IPVS

IPVSRulesAbnormal

{"timestamp":1741163909810866660,"type":"IPVS","msg":"svc(kube-system/nginx-11222,...) ipvs rules are abnormal, please check it."}

route

VpcRoute

{"timestamp":1741163909810866660,"type":"VpcRoute","msg":"vpc route does not exist for this node."} {"timestamp":1741163909810866660,"type":"VpcRoute","msg":"vpc route creates failed because of conflict."}

route

PodRoute

{"timestamp":1741163909810866660,"type":"PodRoute","msg":"lack routes of pods(ng/default,...)."}

route

DefaultRoute

{"timestamp":1741163909810866660,"type":"DefaultRoute","msg":"default route of pod don't exist(ng-1/default,...)"} {"timestamp":1741163909810866660,"type":"DefaultRoute","msg":"default route of node don't exist(nodeIp)."}

route

PolicyRoute

{"timestamp":1741163909810866660,"type":"PolicyRoute","msg":"rule of vpc-cni pod vpc-cni-ng-0/default is not created."}

nodeconfig

KernelLog

{"timestamp":1741163909810866660,"type":"KernelLog","msg":"['neighbour neighbour: arp_cache: neighbor table overflow!']"}

nodeconfig

IPtablesAbnormal

{"timestamp":1741163909810866660,"type":"IPtablesAbnormal","msg":"FORWARD drop default."}

nodeconfig

IRQBalance

{"timestamp":1741163909810866660,"type":"IRQBalance","msg":"system irqbalance service is running, please stop it and use ksc-queue."}

mtu

MTUAbnormal

{"timestamp":1741163909810866660,"type":"MTUAbnormal","msg":"eth0 mtu is 900 which is abnormal, please check."}

安装部署指南

组件化安装

本工具采用组件化管理模式,部署流程简洁高效:进入对应组件管理界面,点击「安装」按钮即可完成部署操作。

对接Prometheus

可观测性工具支持与用户自建 Prometheus 或 KCE 托管 Prometheus 对接,实现指标数据的统一采集与分析。以下以 KCE 托管 Prometheus 为例,详细说明对接流程:

1. 登录 KCE 控制台,进入目标托管 Prometheus 实例的管理页面;

2. 在新建采集配置页签,参考对应配置示例完成参数填写(如采集目标、指标路径等);

3. 提交配置后,等待配置下发生效即可完成对接。

文档导读
纯净模式常规模式

纯净模式

点击可全屏预览文档内容
文档反馈