Jimmy Xiang, HBase Committer

Introduction

Recently, we changed how HBase assigns regions. This architectural change is referred to as ZK-less region assignment, i.e. assigning regions without involving ZooKeeper. The change allows us to achieve greater scale as well as faster startups and assignment. It is simpler and has less code, improves the speed at which assignments run so we can do faster rolling restarts. The master is also re-architected to handle more regions. This feature will be on by default in HBase 2.0.0.

This post presents the architectural evolution and explains the background that brought on the change. Some performance test results follow. At the end, we briefly discuss some tradeoffs made.

Architectural Evolution

Apache HBase is a Google Bigtable-like NoSQL database on Apache Hadoop and HDFS. A typical HBase cluster (releases 0.90 up through 1.0) had one active Master, several backup Masters, and a list of RegionServers. RegionServers host regions that serve table data. The active Master is responsible for assigning regions to RegionServers (refer to the HBase architecture for more information).

In the blog post Apache HBase AssignmentManager Improvements, we talked about AssignmentManager and some region assignment details. Most of the improvements made it into the HBase 0.96.0 release. With these improvements, and many others, 0.96 is the first release that IntegrationTestBigLinkedList (a test that ensures no data loss at scale similar to goraci)  passes  without data loss or inconsistencies while our chaos monkey fault injection framework performs  destructive actions.   The cluster and its data is able to survive  while chaos monkey is randomly moving/splitting/merging regions, restarting master/region servers, and dynamically modifying table schemas.

Generally, HBase is considered to be much more stable since release 0.96/0.94. However, we have not changed how regions are assigned, not since we added ZooKeeper mediated assignment in HBase 0.90 (Before 0.90, region assignment didn’t use ZooKeeper at all).

Between releases 0.90 and 1.0, RegionServers used ZooKeeper to notify the Master about region assignment progress, and as a store for transient region transition states so that when the Master died, a new Master could pick up where old Master had left off. Once a region is opened, the RegionServer updates the meta table (a HBase system table that stores information about regions, such as start-key, location, etc.) with the new region location. The following graph shows the multiple interactions performed  during the ZooKeeper based region assignments.

This graph does not look that complicated. However, to ensure consistent region states under many racing conditions, the code is quite complex. The following graph shows the interactions among entities involved in one region assignment.

Due to the complexity of the existing region assignment logic, discussions on how to simplify how Master coordinated tasks are managed were started in HBASE-5487. This JIRA is not just about region assignment. In this issue, the community started to discuss problems using ZooKeeper for region assignment. A popular idea was to put the meta table on the master (co-locating meta and master, see below for explanation), and assign regions without using ZooKeeper (ZK-less region assignment). With the meta table on the master, meta is always available to the master. The region states in Master memory acts as a cache for the meta table. This makes it easy to ensure that the meta and in-memory region states in Master are consistent at all times. All transient states are persisted in meta. Another benefit of not using ZooKeeper is that all region state transitions are reported to the Master synchronously. The state transition is confirmed by the Master right way. In this design, RegionServers cannot complete a region transition without first having approval from the Master. This also helps to maintain the region states integrity.

These two suggestions were implemented in HBASE-10569 and HBASE-11059 in HBase 1.0. This new scheme is disabled by default in 1.0. In 2.0, ZooKeeper based region assignment is removed (HBASE-11611). Currently in the master branch, for zk-less assignment to work, meta must be colocated on master but the intent is that by 2.0, this will no longer be required: i.e. meta can be anywhere on the cluster.

Consistent Region States and the State Machine

In ZK-less (a.k.a RPC-based) region assignments, ZooKeeper isn’t used any more to pass assignment information between the Master and RegionServers.  Instead, all assignment state information is persisted and managed by the Master. RegionServers now notify the Master by making RPC calls to the Master directly. When a region is opened, instead of having the RegionServer update the meta table itself, now it just notifies the Master and the Master updates the meta table instead: i.e. one writer rather than many. The following graph shows the significantly simpler interactions during the RPC-based region assignments.

With ZooKeeper based region assignments, the region state information is maintained in three places: ZooKeeper, meta table, and Master memory. ZooKeeper stores the region state transition information. Meta table stores the region location after a region is assigned. The Master tries to maintain the current/latest state of each region. Both Master and RegionServers can update the information in ZooKeeper. Only RegionServers update the region assignment information in meta table. The master updates region states before sending region assignment requests to region servers. The Master also gets region state updates based on the updated information in ZooKeeper and meta table.

With RPC based region assignments, the region state information is maintained in two places: meta table, and Master memory. Only the Master updates the meta table. The Master makes sure the information in memory is consistent with the meta table.

As discussed in the Master coordinated operations issue (HBASE-5487), there are some tradeoffs against using ZooKeeper for region assignment:

The Master does the region assignment based on the state in its memory. So the state in memory must be the latest. Otherwise, the Master could assign a region based on wrong information, which leads to problems such as region double assignment, stuck in transition, or not assigned at all.

In ZooKeeper based region assignments, the state information in three places is not consistent all the time. We have to have complex logic to handle all kinds of racing problems to make sure we have reliable and consistent region assignment. The complexity in the logic and the interaction makes it hard to maintain, improve, or add new features. We don’t even have a strict region state machine to follow in ZooKeeper based region assignments.  With ZK-less region assignment, we have a strict region state machine now:

The state machine is documented in the HBase reference guide. You can find more information in patch HBASE-11961.

Simpler Logic and Less Code

With ZK-less region assignment, we simplified the region assignment logic. There are fewer concurrency problems. We now need less code when compared to ZooKeeper based region assignment, which makes the mechanism easier to understand and maintain. In release 2.0, the ZooKeeper based region assignment is completely removed. In patch HBASE-11611, we deleted around 9000 lines of source code (SLOC) which is roughly 2% of the code in module hbase-client and hbase-server.

Improvements

Scalability/performance

In patch HBASE-11059 ZK-less region assignment, we did some performance testing to show the time it takes to assign regions when a HBase cluster starts up. The following graph is from HBASE-11059:

The blue line is for ZooKeeper based region assignment, and the green line for for ZK-less. It shows that it takes ZooKeeper based region assignment more time as the region number increases. For example, to assign 1 million regions, it took ZooKeeper based region assignment about 4100 seconds (68.3 minutes); ZK-less region assignment about 320 seconds (5.3 minutes).

This shows that ZK-less region assignment can assign more regions in less time.

Server rolling restart

The following graph shows the time to rolling restart a cluster. In this test, we have a small cluster with one master and 5 region servers. Load balancer is turned off. Here is what we do:

  1. At first, we create a table with an expected number of regions;

  2. Load it with 1 million rows. We make sure the cluster is balanced;

  3. Move regions off one RegionServer to other RegionServers evenly;

  4. Restart the RegionServer;

  5. Move the same set of regions back;

  6. Do the same for all RegionServers;

  7. Do the same for Master.

We measure the time from Step 3 to 7, i.e. the total time to rolling restart the whole cluster. The result shows it takes less time with ZK-less based region assignment to rolling restart a HBase cluster.

Server recovery

Any node in a cluster could crash. We did some testing to find out how soon a region will be available after a node dies. During this testing, we have a cluster with 1 master, 1 backup master, 3 region servers, and 3000 user regions. We used ycsb to measure the maximum latency while some node is killed and the cluster is recovered. The other measurements like average latency is not used since it’s uncertain if ycsb tries to access to the same regions at each test, or if the same regions are impacted while a node is killed.

With the ZK-less assignment and meta co-located with Master, here is what we got:

Test - the node killed

Maximum latency (s)

Master

51.88

Region server

48.33

Master + region server

68.57

With ZK based region assignment and meta not co-located with Master, here is what we got:

Test - the node killed

Maximum latency (s)

Master

1.93

Region server with no meta

49.41

Region server with meta

56.46

Master + region server with meta

86.55

In ZK-less region assignment, the Master hosts the meta region. So killing the master in the new scheme, the impact is much bigger than killing the master in the old scheme.

Reliability

As we mentioned above, generally, release 0.96 is considered to be much more stable when compared to previous releases. By the same standard, ZK-less region assignment is stable too. IntegrationTestBigLinkedList with chaos monkey is always green for us.

Co-locating Meta and Master

For ZK-less region assignment, currently we need meta to be co-located with master. Master needs meta to be available all the time. During region transition, we always persist region states in meta. If meta is not available, no region transition can happen since we can’t update meta. Currently, we will keep trying to update meta in such a case, while the lock to the region states object is held (the lock is to ensure the states in memory is consistent). This leads to a deadlock. That’s why it’s required that meta to be co-located with Master currently.

Actually, it is good to have meta on master for reasons like performance, simplicity, etc. If meta is on Master, Master can access meta through short-circuit, bypass network and RPC layers. This is good for performance. Whenever a Master starts up, it assigns meta to itself. So Master can assume meta is always available, which makes things simpler.

With co-location, if a Master fails over, the meta is always needed to recover and re-assign. This is very different from before. This puts the Master in the critical path since applications need to access the meta to find the locations of user regions if not already cached by the applications.

If you have millions of regions or less, one meta region is enough. It is an improvement over current scalability for ZK-less region assignment with the co-location restriction. Further region scalability improvements will require splitting meta, while region state transitions are still persisted. This is future work.

Summary

RPC based region assignment currently requires the meta region to be assigned on the Master (there are discussions to eliminate such a limitation). Compared to ZooKeeper based region assignment, RPC based region assignment is simpler and has better performance, however, currently it needs meta to be co-located with the Master.