HBase Replication can miss records due to premature EOF treatment of a WAL

0 votes
0 views
asked Aug 30, 2017 in Hadoop by admin (4,410 points)
Summary
Symptoms

Using a tool such as org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication to verify replication between two HBase clusters it may be observed that the record counts do not match. Please remember that there can be other reasons for record counts not matching so this does not represent a 100% guarantee that this issue described in the KnowledgeBase Article has been encountered.

Applies To

CDH5
HBase

Cause

The issue is caused by HBASE-15984, Handle premature EOF treatment of WALs in replication.  In some particular situations, the Replication code believes it has reached the EOF for a WAL prior successfully parsing all bytes known to exist in a cleanly closed WAL file.
Consistently this failure happens due to an InvalidProtobufException after some number of seeks during attempts to tail the in-progress RegionServer WAL. As a fix, HBASE-15984 treats cleanly closed files differently than other execution paths. If an EOF is detected due
to parsing or other errors while there are still unparsed bytes before the end-of-file trailer, we now reset the WAL to the very beginning and attempt a clean read-through. A single reset should be sufficient to work around the observed replication failure. However, the above change will retry a given WAL file
indefinitely. On each such attempt, a log message similar to below will be emitted at the WARN level. If after applying HBASE-15984 these repeated WARNs are observed and repeated multiple times for the same WAL file then Cloudera Support should be engaged as this would indicate that there is an issue 
with the given WAL file.

2017-02-28 07:13:06,194  WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Processing end of WAL file 
'hdfs://nameservice1/hbase/WALs/host1.company.org,60020,1487488335489/host1.company.org%2C60020%2C1487488335489.null0.1488287585184'. At position 5406208, which is too far away from reported file length 5932585. 
Restarting WAL reading (see HBASE-15983 for details). stats: Total replicated edits: 1213042, current progress: walGroup [host1.company.org%2C60020%2C1487488335489.null0]: currently replicating 
from: hdfs://nameservice1/hbase/WALs/host1.company.org,60020,1487488335489/host1.company.org%2C60020%2C1487488335489.null0.1488287585184 at position: 5406208

Instructions

Customers are advised to contact Cloudera Support if they suspect that they are encountering this issue. To repeat, please remember that there can be other reasons for record counts not matching with VerifyReplication so this does not represent a 100% guarantee that this issue described in the KnowledgeBase Article has been encountered.

Please log in or register to answer this question.

...