Ambari告警信息

最近更新时间:2018-09-13 19:01:51

Ambari告警信息

告警级别

告警级别 解释 说明
OK ok 集群运行良好
WARNING 警告 集群指标到了一定阈值,需关注
CRITICAL 危险 集群的状态可能存在问题,需进行一定的处理
UNKNOWN 未知 状态未知
NONE

告警(alert)类型

类型 用途 告警级别 阀值是否可配置 单位
PORT 用来监测机器上的一个端口是否可用 OK,WARN, CRIT
METRIC 用来监测 Metric 相关的配置属性 OK,WARN, CRIT 变量
AGGREGATE 用于收集其他某些 Alert 的状态 OK,WARN, CRIT 百分比
WEB 用于监测一个 WEB UI(URL)地址是否可用 OK,WARN, CRIT
SCRIPT Alert 的监测逻辑由一个自定义的 python 脚本执行 OK,WARN, CRIT

告警(alert)类型

报警状态如下;现对其进行解释说明

image.png

字段 解释 备注
Service 服务,在其中可选具体组件以查看其告警状态
Host 主机id ,显示所告警的虚机id
Status 状态,如告警级别中所示,总分为5种
24-Hour 告警时长
Response 响应,报警具体内容 点击可展示具体所告警全部内容

Ambari 告警description翻译

注:WARNING 用警告表示;CRITICAL 用危险表示

HDFS说明

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
HDFS Storage Capacity Usage(Weekly)
HDFS存储已使用量(每周)
This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified threshold within a week period.
一周内存储增量偏移值超过设定阈值则触发此服务级告警。
20%
DataNode Unmounted Data Dir
DataNode 未安装的数据目录
This host-level alert is triggered if one of the data directories on a host was previously on a mount point and became unmounted. If the mount history file does not exist, then report an error if a host has one or more mounted data directories as well as one or more unmounted data directories on the root partition. This may indicate that a data directory is writing to the root partition, which is undesirable.
主机上有数据目录在原先挂载点上且未卸载,则触发此主机级告警。若安装历史文件不存在,主机有一个或这个挂载点或未挂载点数据目录在根分区上则报错。这意味着数据目录正在写入根分区,这是不可取的。
2分钟
JournalNode Web UI This host-level alert is triggered if the JournalNode Web UI is unreachable.
不能访问 JournalNode Web UI 时触发此主机级告警。
Connection failed to {1} ({3})
DataDode Process
DataNode进程
This host-level alert is triggered if the individual DataNode processes cannot be established to be up and listening on the network.
不能启动单个NataNode进程及在网络上监听单个NataNode进程时触发此主机级告警。
5
DataNode Web UI This host-level alert is triggered if the DataNode Web UI is unreachable.
不能访问DataNode Web UI 时触发此主机级告警。
Connection failed to {1} ({3})
DataNode Storage
DataNode存储
This host-level alert is triggered if storage capacity if full on the DataNode. It checks the DataNode JMX Servlet for the Capacity and Remaining properties. The threshold values are in percent.
NataNode上存储容量满时触发此主机级告警。会检查NataNode JMX服务上已存储和可存储容量。阈值以百分比形式展示。
80%
DataNode Heap Usage
DataNode堆使用情况
This host-level alert is triggered if heap usage goes past thresholds on the DataNode. It checks the DataNode JMXServlet for the MemHeapUsedM and MemHeapMaxM properties. The threshold values are in percent.
NataNode上堆使用情况超过设定阈值时触发此主机级告警。会检查NataNode JMX服务中已使用堆及堆最大量情况。阈值以百分比形式展示。
90%
HDFS Pending Deletion Blocks
HDFS所挂起的删除块
This service-level alert is triggered if the number of blocks pending deletion in HDFS exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the PendingDeletionBlock property.
HDFS中待删除块的量超过所配置的警告和告警阈值,则触发此服务级告警。会检查NameNode JMX 中的挂起的需删除块的数量。
100000
NameNode Client RPC Queue Latency(Daily)
NameNode客户端RPC队列延迟(每天)
This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified threshold within a day period.
在一天中客户端接口的RPC延迟偏移量增长率超过设定阈值时触发此服务级告警。
200%
DataNode Health Summary
NataNode 健康状态
This service-level alert is triggered if there are unhealthy DataNodes.
有不健康的NataNode时触发此服务级告警。
1
HDFS Upggade Finalized State
HDFS升级完成状态
This service-level alert is triggered if HDFS is not in the finalized state.
HDFS不在完成状态时触发此服务级告警。
1
NameNode Client RPC Processing Latency(Daily)
NameNode客户端RPC进程延迟(每天)
This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified threshold within a day period.
一天中客户端接口的RPC延迟增量增长率超过设定阈值时触发此服务级告警。
200%
NameNode Blocks Health
NameNode块健康状态
This service-level alert is triggered if there are unhealthy DataNodes
有不健康的NataNode时触发此服务级告警。
1
NameNode Web UI NameNode Blocks Health
NameNode块健康状态
This host-level alert is triggered if the NameNode Web UI is unreachable.
不能访问NameNode Web UI 时触发此主机级告警。
Connection failed to {1} ({3})
NameNode Heap Usage (Daily)
NameNode堆使用(每天)
This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold within a day period.
一天中NameNode堆使用增量增长率超过所设定阈值时触发此服务级告警。
50%
NameNode Last Checkpoint
NameNode最后检查
This service-level alert will trigger if the last time that the NameNode performed a checkpoint was too long ago. It will also trigger if the number of uncommitted transactions is beyond a certain threshold.
距上次NameNode检查时间太长时触发此服务级告警。未提交事务超过某个阈值时也会触发此告警。
200%
NameNode RPC Latency
NameNode RPC延迟
This host-level alert is triggered if the NameNode RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for NameNode operations. The threshold values are in milliseconds.
NameNode RPC延迟超过所设定阈值时触发此主机级告警。比较典型的是RPC进程时间增加会增加RPC队列长度,造成NameNode操作平均队列等待时间增长。阈值以毫秒级计。
500
HDFS Storage Capacity Usage(Daily)
HDFS存储使用量(每天)
This service-level alert is triggered if the increase in storage capacity usage deviation has grown beyond the specified threshold within a day period.
一天中存储容量使用率增量超过特定阈值时触发此服务级告警。
50%
NameNode client RPC Queue Latency (Hourly)
NameNode客户端RPC队列延迟(每小时)
This service-level alert is triggered if the deviation of RPC queue latency on client port has grown beyond the specified threshold within an hour period.
一小时中客户端接口的RPC队列延迟增长率超过特定阈值时触发此服务级告警。
200%
HDFS Capacity Utilization
HDFS容量使用
This service-level alert is triggered if the HDFS capacity utilization exceeds the configured warning and critical thresholds. It checks the NameNode JMX Servlet for the CapacityUsed and CapacityRemaining properties. The threshold values are in percent.
HDFS容量使用超过设定警告和告警阈值时触发此服务级告警。会检查NameNode JMX中的容量使用和容量存留。阈值以百分比形式展示。
80%
NameNode Heap Usage (Weekly)
NameNode 堆使用(每周)
This service-level alert is triggered if the NameNode heap usage deviation has grown beyond the specified threshold within a week period.
一周中NameNode 堆使用偏移量增长率超特定阈值时触发此服务级告警。
50%
NameNode Directory Status
NameNode文档状态
This host-level alert is triggered if the NameNode NameDirStatuses metric (name=NameNodeInfo/NameDirStatuses) reports a failed directory. The threshold values are in the number of directories that are not healthy.
NameNode 名字文档中有失效文档时触发此主机级告警。阈值是不健康文档数量。
1
NameNode Client RPC Latency (Hourly)
NameNode客户端RPC延迟(每小时)
This service-level alert is triggered if the deviation of RPC latency on client port has grown beyond the specified threshold within an hour period.
一小时中客户端接口RPC延迟偏移量增长率超过特定阈值时触发此服务级告警。
200%
NameNode Host CPU Utilization
NameNode主机CPU使用
This host-level alert is triggered if CPU utilization of the NameNode exceeds certain warning and critical thresholds. It checks the NameNode JMX Servlet for the SystemCPULoad property. The threshold values are in percent.
NameNode的CPU使用超过警告和告警阈值时触发此主机级告警。会检查NameNode JMX中的系统CPU载入量。阈值以百分比形式展示。
250%
Zookeeper Failover Controller Process
zookeeper故障转移控制进程
This host-level alert is triggered if the ZooKeeper Failover Controller process cannot be confirmed to be up and listening on the network.
zookeeper故障转移控制进程不能被确认已启动或被网络监听时触发此主机级告警。
6
NameNode High Availability Health
NameNode高可用健康状态
This service-level alert is triggered if either the Active NameNode or Standby NameNode are not running.
主节点或备用节点的NameNode都不运行时触发此服务级告警。
1
Percent DataNodes Available
DataNode可用百分比
This alert is triggered if the number of down DataNodes in the cluster is greater than the configured critical threshold. It aggregates the results of DataNode process checks.
集群中挂掉的DataNode数目超过所设定阈值时触发此告警。这会聚合DataNode进程检查结果。
30%
Percent DataNodes With Available Sapce
DataNode中可用空间百分比
This service-level alert is triggered if the storage on a certain percentage of DataNodes exceeds either the warning or critical threshold values.
一定比例的DataNode中存储量超过警告或告警阈值时触发此服务级告警。
30%
Percent JournalNodes Available
JournalNode可用百分比
This alert is triggered if the number of down JournalNodes in the cluster is greater than the configured critical threshold. It aggregates the results of JournalNode process checks.
集群中所挂掉JournalNode数目多于所设定阈值时触发此告警。会聚合JournalNode进程检查结果。
50%
NameNode Service RPC Processing Latency (Hourly)
NameNode服务RPC延迟(每小时)
This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within an hour period.
一小时中DataNode接口总的RPC延迟偏移量超过所设定阈值时触发此服务级告警。
200%
NameNode Service RPC Queue Latency(Hourly)
NameNode服务RPC队列延迟(每小时)
This service-level alert is triggered if the deviation of RPC queue latency on datanode port has grown beyond the specified threshold within an hour period.
一小时中datanote接口中RPC队列延迟偏移增长率超过设定阈值时触发此服务级告警。
200%
NameNode Service RPC Queue Latency(Daily)
NameNode服务RPC队列延迟(每天)
This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within a day period.
一天中datanote接口中RPC延迟偏移增长率超过设定阈值时触发此服务级告警。
200%
Secondary NameNode Process
NameNode副本进程
This host-level alert is triggered if the Secondary NameNode process cannot be confirmed to be up and listening on the network.
NameNode副本进程不能被确认已启动或被网络监听时触发此主机级告警。
Connection failed to {1} ({3})
NFS Gateway Process
NFS网关进程
This host-level alert is triggered if the NFS Gateway process cannot be confirmed to be up and listening on the network.
NFS网关进程不能被确认已启动或被网络监听时触发此主机级告警。
5
NameNode Service RPC Processing Latency(Daily)
NameNode服务RPC进程延迟(每天)
This service-level alert is triggered if the deviation of RPC latency on datanode port has grown beyond the specified threshold within a day period.
一天中DataNode接口的RPC延迟偏移量增长率超过设定阈值时触发此主机级告警。
200%

YARN

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
NodeManager Web UI This host-level alert is triggered if the NodeManager Web UI is unreachable.
不能访问 NodeManager Web UI时触发此主机级告警。
Connection failed to {1} ({3})
NodeManager Health
NodeManager健康状态
This host-level alert checks the node health property available from the NodeManager component.
此主机级告警检查NodeManager组件中的节点健康状态。
1
ResourceManager Web UI This host-level alert is triggered if the ResourceManager Web UI is unreachable.
不能访问 ResourceManager Web UI时触发此主机级告警。
Connection failed to {1} ({3})
ResourceManager CPU Utilization
ResourceManager CPU 使用情况
This host-level alert is triggered if CPU utilization of the ResourceManager exceeds certain warning and critical thresholds. It checks the ResourceManager JMX Servlet for the SystemCPULoad property. The threshold values are in percent.
ResourceManager CPU 使用增长率超过警告及告警阈值时触发此主机级告警。会检查ResourceManager JMX的系统CPU负载能力。阈值以百分比形式展示。
250%
ResourceManager RPC Latency
ResourceManager RPC 延迟
This host-level alert is triggered if the ResourceManager operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for ResourceManager operations. The threshold values are in milliseconds.
ResourceManager RPC 延迟超过设定告警阈值时触发此主机级告警。典型情况下增加RPC进程时间会增加RPC队列长度,使ResourceManager操作的平均队列等待时间增加。此阈值为毫秒级。
5000
NodeManager Health Summary
NodeManager健康状态
This service-level alert is triggered if there are unhealthy NodeManagers
有不健康的NodeManager时触发此服务级高级。
1
Percent NodeManagers Available
可用NodeManager百分比
This alert is triggered if the number of down NodeManagers in the cluster is greater than the configured critical threshold. It aggregates the results of NodeManager process checks.
集群中挂掉的NodeManager数量超过所设定告警阈值时触发此告警。会聚合NodeManager进程检查结果。
30%
App Timeline Web UI This host-level alert is triggered if the App Timeline Server Web UI is unreachable.
不能访问App Timeline Server Web UI时触发此主机级告警。
Connection failed to {1} ({3})
Failed Apps Check
失败的App检查
This service-level alert is triggered if failed yarn apps is beyond the specified threshold within a given time span.
在给定时间内失败的yarn app数超过阈值时触发此服务级告警。
2

MapReduce2

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
History Server Web UI
历史服务器Web UI
This host-level alert is triggered if the History Server Web UI is unreachable.
不能访问历史服务器Web UI时触发此主机级告警。
Connection failed to {1} ({3})
History Server RPC Latency
历史服务器RPC延迟
This host-level alert is triggered if the History Server operations RPC latency exceeds the configured critical threshold. Typically an increase in the RPC processing time increases the RPC queue length, causing the average queue wait time to increase for operations. The threshold values are in milliseconds.
历史服务器RPC延迟超过设定阈值时触发此主机级告警。一般增减RPC进程时间会增加RPC队列长度,使操作的平均队列等待时间增加。阈值为毫秒级。
5000
History Server CPU Utilization
历史服务器CPU使用情况
This host-level alert is triggered if the percent of CPU utilization on the History Server exceeds the configured critical threshold. The threshold values are in percent.
历史服务器的CPU使用百分比超过阈值时触发此主机级告警。阈值以百分比形式展示。
250%
History Server Process
利用服务器进程
This host-level alert is triggered if the History Server process cannot be established to be up and listening on the network.
历史服务器进程不能被启动或从网络监听时会触发此主机级告警。
5

Hive

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
WebHCat Server Status
WebHCat 服务器状态
This host-level alert is triggered if the templeton server status is not healthy.
templeton 服务器状态不健康时触发此主机级告警。
5
HiveServer2 Interactive Process
HiveServer2 交互过程
This host-level alert is triggered if the HiveServerInteractive cannot be determined to be up and responding to client requests.
Hive服务器交互不能确认已启动和响应客户端时会触发此主机级告警。
60
Hive MetaStore Process
Hive元数据过程
This host-level alert is triggered if the Hive Metastore process cannot be determined to be up and listening on the network.
Hive元数据过程不能确定已启动和从网络进行监听时触发此主机级告警。
60
LLAP Application
LLAP应用
This alert is triggered if the LLAP Application cannot be determined to be up and responding to requests.
LLAP应用不能确定已启动和响应客户端时触发此主机级告警。
120
HiveServer2 Process
HiveServer2 进程
This host-level alert is triggered if the HiveServer cannot be determined to be up and responding to client requests.
Hive服务器不能确定已启动和响应客户端时触发此主机级告警。
60

HBase

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
HBase RegionServer Process
HBase RegionServer 进程
This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds.
HBase RegionServer 进程不能确认已启动和在给定阈值(秒级)下从网络监听时触发此主机级告警。
5
HBase Master Process
HBase 主节点进程
This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds.
HBase 主节点进程不能确认已启动和在给定阈值(秒级)下从网络监听时触发此主机级告警。
5
Percent RegionServers Available
RegionServer可用百分比
This service-level alert is triggered if the configured percentage of RegionServer processes cannot be determined to be up and listening on the network for the configured warning and critical thresholds. It aggregates the results of RegionServer process down checks.
所配置一定百分比的RegionServer进程不能确认已启动或从网络监听时触发此服务级告警。这会聚合RegionSever进程失败检查结果。
30%
HBase Mater CPU Utilization
HBase主节点CPU使用情况
This host-level alert is triggered if CPU utilization of the HBase Master exceeds certain warning and critical thresholds. It checks the HBase Master JMX Servlet for the SystemCPULoad property. The threshold values are in percent.
HBase主节点上CPU使用超过所设置的警告及告警阈值会触发此主机级告警。这会检查HBase 主节点JMX中的系统CPU负荷情况。阈值以百分比形式展示。
250%

Oozie

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
Oozie Server Web UI This host-level alert is triggered if the Oozie server Web UI is unreachable.
不能访问此Oozie服务器Web UI时触发此告警。
Connection failed to {1} ({3})
Oozie Server Status
Oozie Server状态
This host-level alert is triggered if the Oozie server cannot be determined to be up and responding to client requests.
Oozie Server不能确定已启动和不响应客户端请求时触发此主机级告警。
1

ZooKeeper

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
ZooKeeper Server Process
ZooKeeper服务器进程
This host-level alert is triggered if the ZooKeeper server process cannot be determined to be up and listening on the network.
ZooKeeper服务器进程不能确认已启动和从网络监听时触发此主机级告警。
5
Percent ZooKeeper Servers Available
Zookeeper服务器可用百分比
This alert is triggered if the number of down ZooKeeper servers in the cluster is greater than the configured critical threshold. It aggregates the results of ZooKeeper process checks.
集群中所挂掉Zookeeper服务器数量大于所配置阈值时触发此告警。会对Zookeeper进程检查结果进行聚合。
70%

Storm

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
Supervisor Process 5
Percent Supervisors Available 30%
Storm Web UI Connection failed to {1} ({3})
Nimbus Process 5
DRPC Server Process 5

Kafka

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
Kafka Broker Process
Kafka Broker进程
This host-level alert is triggered if the Kafka Broker cannot be determined to be up.
Kafka Broker进程不能确定是否已启动时触发此主机级告警。
5

spark2

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
Spark2 Livy Server
Spark2 Livy服务器
This host-level alert is triggered if the Livy2 Server cannot be determined to be up.
Livy2 服务器不能确认已启动时触发此主机级告警。
60
Spark2 History Server
Spark2 历史服务器
This host-level alert is triggered if the Spark2 History Server cannot be determined to be up.
Spark2 历史服务不能确认已启动时触发此主机级告警。
5
Spark2 Thrift Server
Spark2 Thrift 服务器
This host-level alert is triggered if the Spark2 Thrift Server cannot be determined to be up.
Spark2 Thrift 服务不能确认已启动时触发此主机级告警。
60

ElasticSearch

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
ElasticSearch Process Check
ElasticSearch 进程检查
This host-level alert is triggered if the ElasticSearch Master cannot be determined to be up.
ElasticSearch 主节点不能确定已启动时触发此主机级告警。
5

Hue

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
Hue Web UI This host-level alert is triggered if the Hue Web UI is unreachable.
不能访问Hue Web UI时触发此主机级告警。
5

Ambari

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
Host Disk Usage
主机硬盘使用情况
This host-level alert is triggered if the amount of disk space used goes above specific thresholds. The default threshold values are 50% for WARNING and 80% for CRITICAL.
主机硬盘使用率超过阈值时触发此主机级告警。阈值默认为警告:50%,告警:80%。
80%
Ambari Agent Distro/conf Select Versions
Ambari 客户端Distro/conf版本选择
This host-level alert is triggered if the distro selector such as hdp-select cannot calculate versions available on this host. This may indicate that /usr/$stack/ directory has links/dirs that do not belong inside of it.
主机上没有所选择distro版本(如hdp选择)时触发此主机级版告警。这可能时因为在 /usr/$stack/目录下含不属于其链接的目录。
5
Host Disk Usage For Dir '/'
主机硬盘使用目录‘/’
This host-level alert is triggered if the amount of disk space used goes above specific thresholds. The default threshold values are 80% for WARNING and 90% for CRITICAL.
硬盘使用量超过阈值时触发此主机级告警。阈值默认为警告:80%,告警:90%。
5.0E9 bytes
Host Disk Usage For Dir '/mnt'
主机硬盘使用目录‘/mnt’
This host-level alert is triggered if the amount of disk space used goes above specific thresholds. The default threshold values are 80% for WARNING and 90% for CRITICAL.
硬盘使用量超过阈值时触发此主机级告警。阈值默认为警告:80%,告警:90%。
5.0E9 bytes
Ambari Agent Heartbeat
Ambari客户端心跳
This alert is triggered if the server has lost contact with an agent.
服务器未收到客户端心跳时出发此告警。
2
Ambari Server Alerts
Ambari 服务器告警
This alert is triggered if the server detects that there are alerts which have not run in a timely manner.
如果服务器检测到没有及时运行的警报时触发此警报。
2
Ambari Server Performance
Ambari服务器性能
This alert is triggered if the Ambari Server detects that there is a potential performance problem with Ambari. This type of issue can arise for many reasons, but is typically attributed to slow database queries and host resource exhaustion.
Ambari服务器检测Amabri有潜在运行问题时触发此告警。有很多因素都有可能导致这个问题,但最常见的是由于数据库查询缓慢以及主机资源耗尽。
5000
component Version
组件版本
This alert is triggered if the server detects that there is a problem with the expected and reported version of a component. The alert is suppressed automatically during an upgrade.
服务器检测到组件有版本问题时触发此告警。在组件升级时常会触发此告警。
5

Ambari Metrics

告警定义名称(Alert Definition Name) 描述(Description) 危险(CRITICAL) 默认值
Metrics Monitor Status
Metrics 监控状态
This alert indicates the status of the Metrics Monitor process as determined by the monitor status script.
此告警指示监控器状态脚本所确定的Metrics监控进程状态。
1
Metrics Collector Process
Metrics收集器进程
This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for number of seconds equal to threshold.
Metrics收集器不能确认已启动或在数秒内监听到的次数没有达到所配置阈值次数时触发此告警。
5
Metrics collector - HBase CPU Utilization
Metrics收集器的HBase CPU使用
This host-level alert is triggered if CPU utilization of the Metrics Collector's HBase Master exceeds certain warning and critical thresholds. It checks the HBase Master JMX Servlet for the SystemCPULoad property. The threshold values are in percent.
Metrics收集器的HBase主节点的CPU使用超过警告和告警阈值。回检查HBase主节点JMX的系统CPU加载情况。阈值以百分比形式展示。
250%
Metrics Collector - Auto-Restart Status
Metrics 收集器自动重启状态
This alert is triggered if the Metrics Collector has been restarted automatically too frequently in last one hour. By default, a Warning alert is triggered if restarted twice in one hour and a Critical alert is triggered if restarted 4 or more times in one hour.
在最后一小时中Metrics收集器自动重启太频繁会触发此告警。一小时中两次重启系统警告,4次重启系统告警。
Metrics Collector has been auto-started {1} times{0}.
Percent Metrics Monitors Available
Metrics监控器可用百分比
This alert is triggered if a percentage of Metrics Monitor processes are not up and listening on the network for the configured warning and critical thresholds.
一定比例(所配置的警告和告警阈值)的Metics 监控器未启动或从不能从网络监听到。
30%
Metrics Collector - HBase Master Process
Metrics收集器的HBase 主节点进程
This alert is triggered if the Metrics Collector's HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds.
在给定时间内(秒)Metics收集器的HBase主节点进程不能确认已启动或从网络监听到触发此报警。
5
Grafana Web UI This host-level alert is triggered if the Grafana Web UI is unreachable.
不能访问Grafana Web UI时触发此告警。
5

金山云,开启您的云计算之旅

注册有礼