net-exporter 容器网络可观测性实践

容器服务(KCE)

查看更多结果

未找到含当前关键字的文档标题

页面目录

全部展开全部收起

产品更新动态

未找到含该关键词的产品

文档中心

容器服务(KCE)

容器服务实践

网络

net-exporter 容器网络可观测工具最佳实践

最近更新时间：2026-01-28 14:48:26



在容器化集群环境中，网络相关问题排查涉及维度多、链路复杂，传统排查方式不仅耗时费力，还易遗漏关键排查方向。为提升容器网络问题排查效率，精准定位网络异常根因，开发并上线了容器网络可观测性工具 net-exporter，为集群网络提供全链路观测能力。

核心架构与部署模式

net-exporter 采用 DaemonSet 模式部署，可确保集群内每个节点均能部署该组件，实现全节点网络数据覆盖。

数据采集方式

组件支持通过多种方式采集宿主机及 Pod 级别的全维度网络观测信息，覆盖网络协议栈全链路，具体包括：

/proc 文件系统
eBPF 动态追踪技术
系统调用接口（如 netlink、conntrack 等）
Linux 网络命令（如 ifconfig、iptables 等）
dmesg 内核日志
Kubernetes API

数据输出模式

通过上述方式采集的观测指标，统一通过 /metrics 接口暴露，采用 Pull 模式供监控系统采集；

事件类数据支持两种输出方式：标准输出（stdout）和推送到 Grafana Loki 日志系统，满足不同场景下的日志存储与分析需求。

核心观测指标描述

net-exporter 覆盖 conntrack、qdisc、netdev、TCP 协议栈等多个核心观测点，提供丰富的量化指标，精准反映网络运行状态。各观测点指标详情如下：

观测点	指标名称	显示名称	描述
conntrack	netexporter_conntrack_found	Conntrack Found times	成功查找到ct记录的次数。示例： netexporter_conntrack_found{k8s_node="192.168.88.112""} 10 *其他同观测点的示例同略*
	netexporter_conntrack_invalid	Conntrack Invalid times	在ct创建过程中由于各种原因无法建立，但是报文并未被丢弃的次数
	netexporter_conntrack_ignore	Conntrack Ignore times	由于ct已经建立或者协议不需要维护ct而跳过的次数
	netexporter_conntrack_insert	Conntrack Insert times	往表中插入条目的个数。目前不会计数，为0
	netexporter_conntrack_insertfailed	Conntrack Insert failed times	插入表失败，比如nat选择了相同的源地址和端口，或者snat无端口可用。
	netexporter_conntrack_drop	Conntrack Drop times	由于ct创建过程中无法建立而丢弃报文的次数。
	netexporter_conntrack_earlydrop	Conntrack Early drop times	conntrack表已满，丢弃了不是双向通信的现有连接的记录，可能会导致现有连接网络不稳定。
	netexporter_conntrack_error	Conntrack Error times	收到与已有连接不匹配的icmp(v6)报文个数
	netexporter_conntrack_searchrestart	Conntrack Search restart times	查找ct过程中由于查找失败而进行重试的次数
	netexporter_conntrack_entries	Conntrack Entries	当前存在的entry的数量
	netexporter_conntrack_maxentries	Conntrack Max Entries	支持的最大entry数量
qdisc	netexporter_qdisc_bytes	TC Qdisc bytes	当前节点上所有以cali和eth开头的设备，在TC Qisc层发送的字节数之和。示例： netexporter_qdisc_bytes{k8s_node="192.168.88.112""} 100 *其他同观测点的示例同略*
	netexporter_qdisc_packets	TC Qdisc Packets	当前节点上所有以cali和eth开头的设备，在TC Qisc层发送的报文数之和
	netexporter_qdisc_drops	TC Qdisc Drops	当前节点上所有以cali和eth开头的设备，在TC Qisc层drop的报文数之和
	netexporter_qdisc_overlimits	TC Qdisc Overlimits	当前节点上所有以cali和eth开头的设备，在TC Qisc层超过limits配置的报文数之和
	netexporter_qdisc_backlog	TC Qdisc Backlog	当前节点上所有以cali和eth开头的设备，在TC Qisc层目前仍在队列中的字节数之和
	netexporter_qdisc_qlen	TC Qdisc Qlen	当前节点上所有以cali和eth开头的设备，TC 队列的长度之和
netdev	netexporter_netdev_rxbytes	Network Device RX bytes	网卡设备接收的总字节数。示例： netexporter_netdev_rxbytes{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1", if_name="eth0"} 3000 *其他同观测点的示例同略*
	netexporter_netdev_rxerrors	Network Device RX errors(times)	网卡设备出现接收错误的次数
	netexporter_netdev_rxpackets	Network Device RX packets	网卡设备接收的总报文数
	netexporter_netdev_rxdropped	Network Device RX dropped(times)	网卡接收错误并产生丢弃的次数
	netexporter_netdev_txbytes	Network Device TX bytes	网卡设备发送的总字节数
	netexporter_netdev_txerrors	Network Device TX errors(times)	网卡设备出现发送错误的次数
	netexporter_netdev_rxfifos	Ring Buffer RX overruns/fifo(times)	RingBuffer RX队列溢出统计
	netexporter_netdev_txfifos	Ring Buffer TX overruns/fifo(times)	RingBuffer TX队列溢出统计
	netexporter_netdev_rxcrcerrors	Network Device RX CRC errors(times)	CRC校验错误个数统计
	netexporter_netdev_txpackets	Network Device TX packets	网卡设备发送的报文数
	netexporter_netdev_txdropped	Network Device TX dropped(times)	网卡发送错误并产生丢弃的次数
tcp	netexporter_tcp_currestab	TCP Current Conns established	TCP当前存在的活跃连接数 netexporter_tcp_currestab{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1"} 30 *其他同观测点的示例同略*
	netexporter_tcp_attemptfails	TCP Attempt fails(times)	TCP尝试建立连接但是最终失败的总次数
	netexporter_tcp_estabresets	TCP Established Conns resets(times）	异常关闭TCP连接的次数
	netexporter_tcp_inerrs	TCP Received errors	TCP层接收到的错误报文总次数
	netexporter_tcp_insegs	TCP Received segs	TCP层接收到的有效报文段总数
	netexporter_tcp_outrsts	TCP Send Resets	TCP发送的reset报文次数
	netexporter_tcp_outsegs	TCP Send segs	TCP层发送的有效报文段总数
	netexporter_tcp_activeopens	TCP Active Conns	TCP成功发起SYN初次握手的次数，不包括SYN的重传，但是连接建立失败也会导致这个指标上升。
	netexporter_tcp_passiveopens	TCP Passive Conns	TCP完成握手并成功分配sock的累积值，通常可以理解为成功新建连接的数量。
	inetexporter_tcp_retransseg	TCP Retransmit segs	TCP重传的总报文数，这里已经根据tso进行了提前的分片计算。
tcpext	netexporter_tcpext_listenoverflows	TCP Listen Overflows(times)	当LISTEN状态的Sock接受连接时出现半连接队列溢出时会计数 netexporter_tcpext_listenoverflows{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1"} 0 *其他同观测点的示例同略*
	netexporter_tcpext_listendrops	TCP Listen Drops(times)	当LISTEN状态的Sock创建SYN_RECV状态的Sock失败时会计数
	netexporter_tcpext_tcpsynretrans	TCP SYN Retransmit(times)	重传的SYN报文次数。
	netexporter_tcpext_tcpfastretrans	TCP Fast Retransmit(times)	CA状态部位loss时进行的重传均会进行计数
	inetexporter_tcpext_tcpretransfail	TCP Retransmit Fail(times)	重传报文返回除了EBUSY之外的报错时计数，说明重传无法正常完成。
	netexporter_tcpext_tcptimeouts	TCP Timeouts(times)	CA状态并未进入recovery/loss/disorder时触发，当SYN报文未得到回复时进行重传会计数。
	netexporter_tcpext_tcpabortonclose	TCP Conn Abort on close(times)	状态机之外的原因关闭TCP连接时，仍有数据没有读取而发送Reset报文，则会进行指标计数
	netexporter_tcpext_tcpabortonmemory	TCP Conn Abort on memory(times)	在需要分配tw_sock/tcp_sock等逻辑中由于tcp_check_oom触发内存不足而发送Reset结束连接的次数
	netexporter_tcpext_tcpabortontimeout	TCP Conn Abort on timeout (times)	由于keepalive/window probe/重传的调用超过上限发送Reset时会更新此计数
	netexporter_tcpext_tcpabortonlinger	TCP Conn Abort on linger timeout(times)	TCP的Linger2选项开启后，快速回收处于FIN_WAIT2的连接时发送Reset的次数
	netexporter_tcpext_tcpabortondata	TCP Conn Abort on data(times)	由于Linger/Linger2选项开启而通过Reset进行连接的快速回收时发送Reset的计数
	netexporter_tcpext_tcpabortfailed	TCP Conn Abort on failed(times)	统计尝试终止TCP连接但发送RST包失败的次数
	netexporter_tcpext_tcpackskippedsynrecv	TCP ACK Skipped in SYN_RECV(times)	在SYN_RECV状态的sock不回复ACK的次数。
	netexporter_tcpext_tcpackskippedpaws	TCP ACK Skipped due to PAWS(times)	由于paws机制触发校正，但是oow限速限制了ACK报文发送的次数。
	netexporter_tcpext_tcpackskippedseq	TCP ACK Skipped due to Seq(times)	由于序号在窗口外触发较正，但是被oow限速限制了ACK报文发送的次数。
	netexporter_tcpext_tcpackskippedfinwait2	TCP ACK Skipped in FIN_WAIT_2(times)	在fin_wait_2状态下，对于oow报文发送ack，但是因为限速而忽略发送的次数。
	netexporter_tcpext_tcpackskippedtimewait	TCP ACK Skipped in TIME_WAIT(times)	在fin_wait_2状态下，对于oow报文发送ack，但是因为限速而忽略发送的次数。
	netexporter_tcpext_tcpackskippedchallenge	TCP ACK Skipped due to challenges(times)	在需要发送challenge ack（通常用于确认reset报文）时被oow限速的次数。
	netexporter_tcpext_tcprcvqdrop	TCP RX Packets dropped(queue overflow)	当TCP的recv队列出现堆积，并且无法正常分配到内存时，会进行这项计数。
	netexporter_tcpext_pawsestab	TCP Conns Using PAWS	通过PAWS机制建立的TCP连接数量
	netexporter_tcpext_tcpwinprobe	TCP Windows Probe times	发送方主动探测接收方窗口状态的次数
	netexporter_tcpext_tcpkeepalive	TCP Keepalive Packtes send	通过TCP keep-alive机制发送的保活探测报文数量
	netexporter_tcpext_tcpmtupfail	TCP MTU Probe fails(times)	统计TCP路径MTU探测失败次数
	netexporter_tcpext_tcpmtupsuccess	TCP MUT Probe success(times)	统计TCP 路径 MTU探测成功次数
	netexporter_tcpext_tcpzerowindowdrop	TCP Packets Dropped(Zero Window)	接收方因零窗口丢弃的数据包数量
	netexporter_tcpext_tcpbacklogdrop	TCP Packetes Dropped(backlog queue)	因TCP全连接队列溢出导致的新连接丢弃次数
	netexporter_tcpext_pfmemallocdrop	TCP Packetes Dropped(PF_MEMALLOC fail)	因内存压力导致TCP层无法分配缓冲区而丢弃的数据包数量
	netexporter_tcpext_tcpwqueuetoobig	TCP Send Queue Too Big(times)	统计TCP发送队列因容量超限导致数据包被丢弃或无法发送的次数
	netexporter_tcpext_embryonicrsts	TCP Half-open Conns reseted	TCP连接在未完成三次握手阶段收到RST包导致连接终止的次数
	netexporter_tcpext_tcpmemorypressures	TCP Memory Pressure times	进入内核协议栈内存紧缺状态的次数
	netexporter_tcpext_tcpmemorypressureschrono	TCP Memory Pressure events	用于统计TCP协议层内存压力状态持续累积时间
sock	netexporter_sock_inuse	TCP Sockets In Using	活跃的TCP套接字数量（包括watch状态和已建立的连接 netexporter_sock_inuse{k8s_namespace="default",k8s_node="192.168.88.112",k8s_pod="nginx-1"} 18 *其他同观测点的示例同略*
	netexporter_sock_orphan	TCP Sockets orphaned	无主（未关联任何进程）的TCP连接数，通常由异常断开导致‌
	netexporter_sock_tw	TCP Sockets In TIME_WAIT	处于TIME_WAIT状态的套接字数量，用于确保连接关闭的可靠性‌。
	netexporter_sock_alloc	TCP Sockets allocated	已分配的TCP套接字总数（包含已使用和待回收的套接字）‌
	netexporter_sock_mem	TCP Sockets Memory Pages	TCP套接字占用的内存总量（单位为内存页，通常每页4KB）‌
	netexporter_sock_inuse_udp	UDP Sockets In Using	当前活跃的UDP套接字数量‌
	netexporter_sock_mem_udp	UDP Sockets Memory Pages	UDP套接字占用的内存量‌
softnetstat	netexporter_softnet_processed	Packets in CPU Backlog	单个Pod内所有CPU处理的从网卡放入CPU的Backlog的报文数量 netexporter_softnet_processed{k8s_namespace="kube-system",k8s_node="192.168.88.112",k8s_pod="vpc-cni-scheduler-5d5d56bccc-55nrz"} 2.6343316e+07
softnetstat	netexporter_softnet_dropped	Packets Dropped by CPU Backlog	单个Pod内所有CPU处理的从网卡放入CPU的backlog失败并丢弃报文数量 netexporter_softnet_dropped{k8s_namespace="kube-system",k8s_node="192.168.88.112",k8s_pod="vpc-cni-scheduler-5d5d56bccc-55nrz"} 0
tcpsummary	netexporter_tcpsummary_tcpestablishedconn	Tcpsummary Established Conns	当前处于活跃连接状态的TCP连接总数。示例： netexporter_tcpsummary_tcpestablishedconn{namespace="kube-system",node="192.168.88.112",pod="vpc-cni-scheduler-5d5d56bccc-55nrz"} 2 *其他同观测点的示例同略*
	netexporter_tcpsummary_tcptimewaitconn	Tcpsummary TIME_WAIT Conns	tcptimewaitconn：当前存在的TIMEWAIT状态的TCP连接数量
	netexporter_tcpsummary_tcptxqueue	Tcpsummary TX Queue Bytes	tcptxqueue：当前处ESTABLISHED状态的TCP连接的发送队列中存在的数据包的Bytes总数
	netexporter_tcpsummary_tcprxqueue	Tcpsummary RX Queue Bytes	tcprxqueue：当前处ESTABLISHED状态的TCP连接的接收队列中存在的数据包的Bytes总数
packetloss	netexporter_packetloss_netfilter	Packet Loss times(Netfilter)	Netfilter层的丢包统计： netexporter_packetloss_netfilter{k8s_node="192.168.88.112""} 1
	netexporter_packetloss_otherhost	Packet Loss times(OtherHost)	报文的目的mac与当前网卡的mac不一致： netexporter_packetloss_otherhost{k8s_node="192.168.88.112""} 1
	netexporter_packetloss_nosocket	Packet Loss times(No Socket)	找不到socket导致的丢包统计： netexporter_packetloss_nosocket{k8s_node="192.168.88.112""} 1
	netexporter_packetloss_iprpfilter	Packet Loss times(IP rpfilter)	rpfilter验证没通过导致的丢包统计： netexporter_packetloss_iprpfilter{k8s_node="192.168.88.112""} 1
	netexporter_packetloss_nomem	Packet Loss times(OOM）	内存不足导致的丢包统计： netexporter_packetloss_nomem{k8s_node="192.168.88.112""} 1
	netexporter_packetloss_zerowindow	Packet Loss times(Zero Window)	因TCP窗口为0导致的丢包统计： netexporter_packetloss_zerowindow{k8s_node="192.168.88.112""} 1
	netexporter_packetloss_total	Packet Loss times(Total)	总的丢包统计： netexporter_packetloss_total{k8s_node="192.168.88.112""} 100
arp	netexporter_arp_maxentries	ARP Max Entries	能支持的arp个数上限
	netexporter_arp_entries	ARP Entries	已经存在的arp个数
	netexporter_arp_unresolveddiscards	ARP Unresolved Discards	用于控制未解析的地址请求在内核中的排队长度。如果并发arp请求过多超过队列长度，则会drop arp请求
PodIP	netexporter_podip_maxentries	PodIP Max Entries	节点上支持的pod ip个数上限
	netexporter_podip_entries	PodIP Used Number	节点上已经被被使用的pod ip个数
	netexporter_podip_redisuals	PodIP Redisuals Number	节点上残留的pod ip个数
	netexporter_podip_notinpodcidrs	PodIP NotIn PodCIDRs Number	不在当前节点PodCIDR范围内pod个数
accept队列观测	netexporter_sockbacklog_recvq	Socket RecvQ	当前已完成三次握手并等待服务端 accept() 的TCP连接个数，如果持续非0，代表全队列满了

Event指标支持

Event 观测指标聚焦集群网络相关的配置异常、状态异常等关键事件，实时捕捉内核参数、K8s 配置、网络组件、路由、节点配置、MTU 等维度的异常情况，通过结构化日志输出，助力运维人员快速发现并定位配置类、运行类问题。各观测点及对应事件详情如下：

观测点	事件类型	事件日志示例
KernelParam	RPFilterKernelParam	{"timestamp":1741163909810866660,"type":"RPFilterKernelParam","msg":"net.ipv4.conf.calicb6db6f9b07.rp_filter：current_value=1,expect_value=0"}
KernelParam	TCPSyncookiesKernelParam	{"timestamp":1741163909810866660,"type":"TCPSyncookiesKernelParam","msg":"net.ipv4.tcp_syncookies：current_value=0,expect_value=1"}
KernelParam	LiberalKernelParam	{"timestamp":1741163909810866660,"type":"LiberalKernelParam","msg":"net.netfilter.nf_conntrack_tcp_be_liberal：current_value=0,expect_value=1"}
KernelParam	TCPSackKernelParam	{"timestamp":1741163909810866660,"type":"TCPSackKernelParam","msg":"net.ipv4.tcp_sack：current_value=0,expect_value=1"}
KernelParam	ProxyARPKernelParam	{"timestamp":1741163909810866660,"type":"ProxyARPKernelParam","msg":"net.ipv4.conf.eth0.proxy_arp：current_value=1,expect_value=0"}
k8sconfig	NetworkPolicy	{"timestamp":1741163909810866660,"type":"NetworkPolicy","msg":"network policy is configured，please use it carefully."}
k8sconfig	SVCTrafficPolicy	{"timestamp":1741163909810866660,"type":"LocalTrafficPolicy","msg":"externalTrafficPolicy of svc(ng-svc/default,...) is Local, internalTrafficPolicy of svc (ng-svc/default,...) is Local, please check it."}
k8sconfig	IPMasqAgentConf	{"timestamp":1741163909810866660,"type":"IPMasqAgentConf","msg":"NonMasqueradeCIDRs does not contain vpc cidr, it will reduce pod network performance."}
k8sconfig	CIDRConflict	{"timestamp":1741163909810866660,"type":"CIDRConflict","msg":"cidr conflicts：vpc cidr 172.31.0.0/16, svc cidr 10.254.0.0/16, pod cidr 192.168.0.0/16, docker cidr 172.31.0.0/16."}
KubeProxy	KubeProxyAbnormal	{"timestamp":1741163909810866660,"type":"KubeProxy","msg":"kube-proxy looks abnormal, please check it."}
IPVS	IPVSRulesAbnormal	{"timestamp":1741163909810866660,"type":"IPVS","msg":"svc(kube-system/nginx-11222,...) ipvs rules are abnormal, please check it."}
route	VpcRoute	{"timestamp":1741163909810866660,"type":"VpcRoute","msg":"vpc route does not exist for this node."} {"timestamp":1741163909810866660,"type":"VpcRoute","msg":"vpc route creates failed because of conflict."}
route	PodRoute	{"timestamp":1741163909810866660,"type":"PodRoute","msg":"lack routes of pods(ng/default,...)."}
route	DefaultRoute	{"timestamp":1741163909810866660,"type":"DefaultRoute","msg":"default route of pod don't exist(ng-1/default,...)"} {"timestamp":1741163909810866660,"type":"DefaultRoute","msg":"default route of node don't exist(nodeIp)."}
route	PolicyRoute	{"timestamp":1741163909810866660,"type":"PolicyRoute","msg":"rule of vpc-cni pod vpc-cni-ng-0/default is not created."}
nodeconfig	KernelLog	{"timestamp":1741163909810866660,"type":"KernelLog","msg":"['neighbour neighbour: arp_cache: neighbor table overflow!']"}
nodeconfig	IPtablesAbnormal	{"timestamp":1741163909810866660,"type":"IPtablesAbnormal","msg":"FORWARD drop default."}
nodeconfig	IRQBalance	{"timestamp":1741163909810866660,"type":"IRQBalance","msg":"system irqbalance service is running, please stop it and use ksc-queue."}
mtu	MTUAbnormal	{"timestamp":1741163909810866660,"type":"MTUAbnormal","msg":"eth0 mtu is 900 which is abnormal, please check."}