Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Add Apache Hbase RegionServer monitoring #1833

Merged
merged 6 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions home/docs/help/hbase_regionserver.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
id: hbase_regionserver
title: Monitoring HBase RegionServer Monitoring
sidebar_label: HBase RegionServer Monitoring
keywords: [Open-source monitoring system, Open-source database monitoring, RegionServer monitoring]
---
> Collect and monitor common performance metrics for HBase RegionServer.

**Protocol:** HTTP

## Pre-Monitoring Operations

Review the `hbase-site.xml` file to obtain the value of the `hbase.regionserver.info.port` configuration item, which is used for monitoring.

## Configuration Parameters


| Parameter Name | Parameter Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| Target Host | The IPV4, IPV6, or domain name of the monitored entity. Note ⚠️ Do not include the protocol header (e.g., https://, http://). |
| Port | The port number of the HBase regionserver, default is 16030, i.e., the value of the`hbase.regionserver.info.port` parameter |
| Task Name | A unique name to identify this monitoring task. |
| Query Timeout | Set the timeout for Kafka connections in milliseconds, default is 3000 ms. |
| Collection Interval | The interval time for periodic data collection in seconds, with a minimum interval of 30 seconds. |
| Probe Before Adding | Whether to probe and check the availability of monitoring before adding new monitoring, only proceed with the addition if the probe is successful. |
| Description Note | Additional notes to identify and describe this monitoring, users can add notes here. |

### Collection Metrics

> All metric names are directly referenced from the official fields, hence there may be non-standard naming.

#### Metric Set: server


| Metric Name | Unit | Metric Description |
| --------------------------------- | ----- | ------------------------------------------------------------------------- |
| regionCount | None | Number of Regions |
| readRequestCount | None | Number of read requests since cluster restart |
| writeRequestCount | None | Number of write requests since cluster restart |
| averageRegionSize | MB | Average size of a Region |
| totalRequestCount | None | Total number of requests |
| ScanTime_num_ops | None | Total number of Scan requests |
| Append_num_ops | None | Total number of Append requests |
| Increment_num_ops | None | Total number of Increment requests |
| Get_num_ops | None | Total number of Get requests |
| Delete_num_ops | None | Total number of Delete requests |
| Put_num_ops | None | Total number of Put requests |
| ScanTime_mean | None | Average time of a Scan request |
| ScanTime_min | None | Minimum time of a Scan request |
| ScanTime_max | None | Maximum time of a Scan request |
| ScanSize_mean | bytes | Average size of a Scan request |
| ScanSize_min | None | Minimum size of a Scan request |
| ScanSize_max | None | Maximum size of a Scan request |
| slowPutCount | None | Number of slow Put operations |
| slowGetCount | None | Number of slow Get operations |
| slowAppendCount | None | Number of slow Append operations |
| slowIncrementCount | None | Number of slow Increment operations |
| slowDeleteCount | None | Number of slow Delete operations |
| blockCacheSize | None | Size of memory used by block cache |
| blockCacheCount | None | Number of blocks in Block Cache |
| blockCacheExpressHitPercent | None | Block cache hit ratio |
| memStoreSize | None | Size of Memstore |
| FlushTime_num_ops | None | Number of RS writes to disk/Memstore flushes |
| flushQueueLength | None | Length of Region Flush queue |
| flushedCellsSize | None | Size flushed to disk |
| storeFileCount | None | Number of Storefiles |
| storeCount | None | Number of Stores |
| storeFileSize | None | Size of Storefiles |
| compactionQueueLength | None | Length of Compaction queue |
| percentFilesLocal | None | Percentage of HFile in local HDFS Data Node |
| percentFilesLocalSecondaryRegions | None | Percentage of HFile for secondary region replicas in local HDFS Data Node |
| hlogFileCount | None | Number of WAL files |
| hlogFileSize | None | Size of WAL files |

#### Metric Set: IPC


| Metric Name | Unit | Metric Description |
| ------------------------- | ---- | -------------------------------------- |
| numActiveHandler | None | Current number of RITs |
| NotServingRegionException | None | Number of RITs exceeding the threshold |
| RegionMovedException | ms | Duration of the oldest RIT |
| RegionTooBusyException | ms | Duration of the oldest RIT |

#### Metric Set: JVM


| Metric Name | Unit | Metric Description |
| -------------------- | ---- | --------------------------------- |
| MemNonHeapUsedM | None | Current active RegionServer list |
| MemNonHeapCommittedM | None | Current offline RegionServer list |
| MemHeapUsedM | None | Zookeeper list |
| MemHeapCommittedM | None | Master node |
| MemHeapMaxM | None | Cluster balance load times |
| MemMaxM | None | RPC handle count |
| GcCount | MB | Cluster data reception volume |
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
id: hbase_master
title: 监控:Hbase Master监控
sidebar_label: HbaseMaster监控
sidebar_label: Apache Hbase Master
keywords: [开源监控系统, 开源数据库监控, HbaseMaster监控]
---
> 对Hbase Master的通用性能指标进行采集监控
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
id: hbase_regionserver
title: 监控 Hbase RegionServer监控
sidebar_label: Apache Hbase RegionServer
keywords: [开源监控系统, 开源数据库监控, RegionServer监控]
---
> 对Hbase RegionServer的通用性能指标进行采集监控

**使用协议:HTTP**

## 监控前操作

查看 `hbase-site.xml` 文件,获取 `hbase.regionserver.info.port` 配置项的值,该值用作监控使用。

## 配置参数


| 参数名称 | 参数帮助描述 |
| ------------ |---------------------------------------------------------------------|
| 目标Host | 被监控的对端IPV4,IPV6或域名。注意⚠️不带协议头(eg: https://, http://)。 |
| 端口 | hbase regionserver的端口号,默认为16030。即:`hbase.regionserver.info.port`参数值 |
| 任务名称 | 标识此监控的名称,名称需要保证唯一性。 |
| 查询超时时间 | 设置Kafka连接的超时时间,单位ms毫秒,默认3000毫秒。 |
| 采集间隔 | 监控周期性采集数据间隔时间,单位秒,可设置的最小间隔为30秒 |
| 是否探测 | 新增监控前是否先探测检查监控可用性,探测成功才会继续新增修改操作 |
| 描述备注 | 更多标识和描述此监控的备注信息,用户可以在这里备注信息 |

### 采集指标

> 所有指标名称均直接引用官方的字段,所以存在命名不规范。

#### 指标集合:server


| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- |-------|------------------------------------------|
| regionCount | 无 | Region数量 |
| readRequestCount | 无 | 重启集群后的读请求数量 |
| writeRequestCount | 无 | 重启集群后的写请求数量 |
| averageRegionSize | MB | 平均Region大小 |
| totalRequestCount | 无 | 全部请求数量 |
| ScanTime_num_ops | 无 | Scan 请求总量 |
| Append_num_ops | 无 | Append 请求量 |
| Increment_num_ops | 无 | Increment请求量 |
| Get_num_ops | 无 | Get 请求量 |
| Delete_num_ops | 无 | Delete 请求量 |
| Put_num_ops | 无 | Put 请求量 |
| ScanTime_mean | 无 | 平均 Scan 请求时间 |
| ScanTime_min | 无 | 最小 Scan 请求时间 |
| ScanTime_max | 无 | 最大 Scan 请求时间 |
| ScanSize_mean | bytes | 平均 Scan 请求大小 |
| ScanSize_min | 无 | 最小 Scan 请求大小 |
| ScanSize_max | 无 | 最大 Scan 请求大小 |
| slowPutCount | 无 | 慢操作次数/Put |
| slowGetCount | 无 | 慢操作次数/Get |
| slowAppendCount | 无 | 慢操作次数/Append |
| slowIncrementCount | 无 | 慢操作次数/Increment |
| slowDeleteCount | 无 | 慢操作次数/Delete |
| blockCacheSize | 无 | 缓存块内存占用大小 |
| blockCacheCount | 无 | 缓存块数量_Block Cache 中的 Block 数量 |
| blockCacheExpressHitPercent | 无 | 读缓存命中率 |
| memStoreSize | 无 | Memstore 大小 |
| FlushTime_num_ops | 无 | RS写磁盘次数/Memstore flush 写磁盘次数 |
| flushQueueLength | 无 | Region Flush 队列长度 |
| flushedCellsSize | 无 | flush到磁盘大小 |
| storeFileCount | 无 | Storefile 个数 |
| storeCount | 无 | Store 个数 |
| storeFileSize | 无 | Storefile 大小 |
| compactionQueueLength | 无 | Compaction 队列长度 |
| percentFilesLocal | 无 | Region 的 HFile 位于本地 HDFS Data Node的比例 |
| percentFilesLocalSecondaryRegions | 无 | Region 副本的 HFile 位于本地 HDFS Data Node的比例 |
| hlogFileCount | 无 | WAL 文件数量 |
| hlogFileSize | 无 | WAL 文件大小 |

#### 指标集合:IPC


| 指标名称 | 指标单位 | 指标帮助描述 |
| --------------------- | ------ | ------------------- |
| numActiveHandler | 无 | 当前的 RIT 数量 |
| NotServingRegionException | 无 | 超过阈值的 RIT 数量 |
| RegionMovedException | ms | 最老的RIT的持续时间 |
| RegionTooBusyException | ms | 最老的RIT的持续时间 |

#### 指标集合:JVM


| 指标名称 | 指标单位 | 指标帮助描述 |
| ----------------------- | ----- | ------------------------ |
| MemNonHeapUsedM | 无 | 当前活跃RegionServer列表 |
| MemNonHeapCommittedM | 无 | 当前离线RegionServer列表 |
| MemHeapUsedM | 无 | Zookeeper列表 |
| MemHeapCommittedM | 无 | Master节点 |
| MemHeapMaxM | 无 | 集群负载均衡次数 |
| MemMaxM | 无 | RPC句柄数 |
| GcCount | MB | 集群接收数据量 |

2 changes: 2 additions & 0 deletions home/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,8 @@
"help/doris_be",
"help/doris_fe",
"help/hadoop",
"help/hbase_master",
"help/hbase_regionserver",
"help/iotdb",
"help/hive",
"help/airflow",
Expand Down
Loading
Loading