The Anatomy of HBase Region Splits

0 votes
0 views
asked Aug 19, 2017 in Hadoop by admin (4,410 points)
SummaryBest practices related to determining regions and region servers in a cluster
Symptoms
Applies To
 
  • CDH 5
  • HBase
Cause
Instructions

Region Server Deployment

Region Servers can be run on any number of servers, but they must be run on a server running a DataNode.  If they are not run on a server running a DataNode, there will be poor HBase locality and degraded performance because the data must be pulled over the network.

Number of Regions Per Region Server

Factors to Consider

There are many factors to consider when determining the number of regions per RegionServers:
  • Cloudera recommends anywhere from 100 to 200 regions per RegionServer (up to 1 million blocks per DN in HDFS)
  • The total cores and spindles
  • Performance testing the system is recommended
  • HBase can typically handle anywhere from 2TB to 10TB of data per node, depending on the workload on the cluster, and the type of MapReduce jobs running on the cluster
  • Keep the region size around 10 to 30GB (400 regions might be okay if your data set is around 12TB)
  • If the workflow is write-intensive, consider avoiding writes to WAL (if you don't mind some data loss if a node crashes) as that will be the bottleneck
The number of regions need to be designed based on IO capacity and expected store size. 400 regions may be too many if the region size is too small or may be too little if the regions are large. Some will go as low as defining the number of regions equal to the number of spindles on the server (so as to give each writer process the ability to potentially write on a separate disk and increase throughput). Practically that is not the most effective use of your server's spindles.

Region Slop

Also, consider tuning the region slop from 0.2 to something lower. Region slop is the % threshold that the HBase balancer will try to match to keep HBase regions balanced. So the balancer will rebalance if any RegionServer has average + (average * slop) regions. This allows for some deviation of region count on RegionServer from the average region count.

Conclusion

Only split the regions if the region size is larger than the recommended value (around 10GB). Although if regions get rebalanced and you don't see any hotspotting issues, you are probably in a good position with the number of regions per RegionServer. In this case you would be better off not splitting, as there are many factors that could impact performance. Performance test your environment and tweak accordingly based on results.

Note: There is no definitive answer with performance, as it's not an exact science. Every environment is different and so you will have to see what works best for your cluster. These are just some of the factors to consider when deciding on how many regions per RegionServer you would want.-

Please log in or register to answer this question.

...