HDFS and HBASE Data Locality

0 votes
asked Aug 20, 2017 in Hadoop by admin (4,410 points)
This article aims to cover the following questions around operating a HBase cluster over HDFS:
  • How can HBase data locality be maintained when doing a rolling restart of HBase? 
  • What is the impact of running HDFS Balancer on HBase Data locality?
Applies To

Data locality is temporarily lost in a few situations:

HBase balancer moves a region across RegionServers:
      Not much can be done here in terms of avoiding moves, as the table region balancer needs to ensure good distribution of regions across the cluster. Avoiding splits to whatever degree possible should help minimize movements. However, an eventual compaction of the regions after they've moved helps regain locality.

Stopping and starting a cluster/service:
   HBase will try and read the old locality information from META table and attempt to assign the regions to the same region server. In most restart situations, this helps regain older locality state soon after restart.

HDFS Balancer is run:
   HDFS Balancer moves arbitrary blocks between DataNodes for space balancing, and some of these blocks may belong to HBase table region files. After a large Balance operation is run, it is important to also run major compaction on performance-bound tables to ensure they regain whatever percentage of data locality they have lost.

    There is ongoing work upstream (not in CDH5 yet) surrounding HDFS-6133 being tracked in HBASE-15531 that plans to make HBase use block-pinning features of HDFS to avoid getting its blocks moved by the Balancer. This, when entirely done, will help avoid the effects caused by the HDFS Balancer. When the work is completed in the upstream JIRA HBASE-15531 ‚Äčit is not expected that it will be backported to CDH5.

RegionServer dies:
   Not much can be done in terms of retaining data locality, if this occurs. The regions of the failed server move onto other servers, and the region-balancer of HBase does factor in replica-locality when making the assignment decisions so the chances are good that most of the locality would still be maintained. A compaction helps again in this situation, if locality needs to be enforced again and not be left to eventuality.
A table is disable and re-enabled:
    Region placement information is retained, recomputed and reused during situations where a table may be disabled and enabled. When the region balancer decides where to place the regions again, it factors in the HDFS locality information of the region's files and chooses the most optimal server based on data locality among other factors.

Note running a major compaction will restore data locality again. 

Please log in or register to answer this question.