by Stack, the 0.96.0 Release Manager

Here are some notes on our recent hbase-0.96.0 release (For the complete list of over 2k issues addressed in 0.96.0, see Apache JIRA Release Notes).

hbase-0.96.0 was more than a year in the making.   It was heralded by three developer releases -- 0.95.0, 0.95.1 and 0.95.2 -- and it went through six release candidates before we arrived at our final assembly released on Friday, October 18th, 2013.

The big themes that drove this release gleaned of rough survey of users and our experience with HBase deploys were:

  • Improved stability: A new suite of integration cluster tests (HBASE-6241 HBASE-6201), configurable by node count, data sizing, duration, and “chaos” quotient, turned up loads of bugs around assignment and data views when scanning and fetching.  These we fixed in hbase-0.96.0.  Table locks added for cross-cluster alterations and cross-row transaction support, enabled on our system tables by default, now have us wide-berth whole classes of problematic states.

  • Scaling: HBase is being deployed on larger clusters.  How we kept schema in the filesystem or our archiving a WAL file at a time when done replicating worked fine on clusters of hundreds of nodes but made for significant friction when we moved to the next scaling level up.

  • Mean Time To Recovery (MTTR): A sustained effort in HBase and in our substrate, HDFS, narrowed the amount of time data is offline after node outage.

  • Operability: Many new tools were added to help operators of hbase clusters: from a radical redo of the metrics emissions, through a new UI  and exposed hooks for health scripts.  It is now possible to trace lagging calls down through the HBase stack (HBASE-9121 Update HTrace to 2.00 and add new example usage) to figure where time is spent and soon, through HDFS itself, with support for pretty visualizations in Twitter Zipkin (See HBase Tracing from a recent meetup).

  • Freedom to Evolve: We redid how we persisted everywhere, whether in the filesystem or up into zookeeper, but also how we carry queries and data back and forth over RPC.  Where serialization was hand-crafted when we were on Hadoop Writables, we now use generated Google protobufs.  Standardizing serialization on protobufs, with well-defined schemas, will make it easier evolving versions of the client and servers independently of each other in a compatible manner without having to take a cluster restart going forward.

  • Support for hadoop1 and hadoop2: hbase-0.96.0 will run on either.  We do not ship a universal binary.  Rather you must pick your poison; hbase-0.96.0-hadoop1 or hbase-0.96.0-hadoop2 (Differences in APIs between the two versions of Hadoop forced this delivery format).  hadoop2 is far superior to hadoop1 so we encourage you move to it.  hadoop2 has improvements that make HBase operation run smoother, facilitates better performance -- e.g. secure short-circuit reads -- as well as fixes that help our MTTR story.

  • Minimal disturbance to the API: Downstream projects should just work.  The API has been cleaned up and divided into user vs developer APIs and all has been annotated using Hadoop’s system for denoting APIs stable, evolving, or private.  That said, a load of work was invested making it so APIs were retained. Radical changes in API that were present in the last developer release were undone in late release candidates because of downstreamer feedback.

Below we dig in on a few of the themes and features shipped in 0.96.0.

Mean Time To Recovery

HBase guarantees a consistent view by having a single server at a time solely responsible for data. If this server crashes, data is ‘offline’ until another server assumes responsibility.  When we talk of improving Mean Time To Recovery in HBase, we mean narrowing the time during which data is offline after a node crash.  This offline period is made up of phases: a detection phase, a repair phase, reassignment, and finally, clients noticing the data available in its new location.  A fleet of fixes to shrink all of these distinct phases have gone into hbase-0.96.0.

In the detection phase, the default zookeeper session period has been shrunk and a sample watcher script will intercede on server outage and delete the regionservers ephemeral node so the master notices the crashed server missing sooner (HBASE-5844 Delete the region servers znode after a regions server crash). The same goes for the master (HBASE-5926  Delete the master znode after a master crash). At repair time, a running tally makes it so we replay fewer edits cutting replay time (HBASE-6659 Port HBASE-6508 Filter out edits at log split time).  A new replay mechanism has also been added (HBASE-7006 Distributed log replay -- disabled by default) that speeds recovery by skipping having to persist intermediate files in HDFS.  The HBase system table now has its own dedicated WAL, so this critical table can come back before all others (See HBASE-7213 / HBASE-8631).  Assignment all around has been speeded up by bulking up operations, removing synchronizations, and multi-threading so operations can run in parallel.

In HDFS, a new notion of ‘staleness’ was introduced (HDFS-3703, HDFS-3712).  On recovery, the namenode will avoid including stale datanodes saving on our having to first timeout against dead nodes before we can make progress (Related, HBase avoids writing a local replica when writing the WAL instead writing all replicas remote out on the cluster; the replica that was on the dead datanode is of no use come recovery time. See HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes).  Other fixes, such as HDFS-4721 Speed up lease/block recovery when DN fails and a block goes into recovery shorten the time involved assuming ownership of the last, unclosed WAL on server crash.

And there is more to come inside the 0.96.x timeline: e.g. bringing regions online immediately for writes, retained locality when regions come up on the new server because we write replicas using the ‘Favored Nodes’ feature, etc.

Be sure to visit the Reference Guide for configurations that enable and tighten MTTR; for instance, ‘staleness’ detection in HDFS needs to be enabled on the HDFS-side.  See the Reference Guide for how.

HBASE-5305 Improve cross-version compatibility & upgradeability

Freedom to Evolve

Rather than continue to hand-write serializations as is required when using Hadoop Writables, our serialization means up through hbase-0.94.x, in hbase-0.96.0 we moved the whole shebang over to protobufs.  Everywhere HBase persists we now use protobuf serializations whether writing zookeeper znodes, files in HDFS, and whenever we send data over the wire when RPC’ing.

Protobufs support evolving types, if careful, making it so we can amend Interfaces in a compatible way going forward, a freedom we were sorely missing -- or to be more precise, was painful to do -- when all serialization was by hand in Hadoop Writables.  This change breaks compatibility with previous versions.

Our RPC is also now described using protobuf Service definitions.  Generated stubs are hooked up to a derivative, stripped down version of the Hadoop RPC transport.  Our RPC now has a specification.  See the Appendix in the Reference Guide.

HBASE-8015 Support for Namespaces

Our brothers and sisters over at Yahoo! contributed table namespaces, a means of grouping tables similar to mysql’s notion of database, so they can better manage their multi-tenant deploys.  To follow in short order will be quota, resource allocation, and security all by namespace.

HBASE-4050 Rationalize metrics, metric2 framework implementation

New metrics have been added and the whole plethora given a radical edit, better categorization, naming and typing; patterns were enforced so the myriad metrics are navigable and look pretty up in JMX.  Metrics have been moved up on to the Hadoop 2 Metrics 2 Interfaces.  See Migration to the New Metrics Hotness – Metrics2 for detail.

New Region Balancer

A new balancer using an algorithm similar to Simulated annealing or Greedy Hillclimbing factors in not only region count, the only attribute considered by the old balancer, but also region read/write load, locality, among other attributes, coming up with a balance decision.

Cell

In hbase-0.96.0, we began work on a long-time effort to move off of our base KeyValue type and move instead to use a Cell Interface throughout the system.  The intent is to open up the way to try different implementations of the base type; different encodings, compressions, and layouts of content to better align with how the machine works. The move, though not yet complete, has already yielded performance gains.  The Cell Interface shows through in our hbase-0.96.0 API with the KeyValue references deprecated in 0.96.0.  All further work should be internal-only and transparent to the user.

Incompatible changes

Miscellaneous


You will need to restart your cluster to come up on hbase-0.96.0.  After deploying the binaries, run a checker script that will look for the existence of old format HFiles no longer supported in hbase-0.96.0.  The script will warn you of their presence and will ask you to compact them away.  This can be done without disturbing current serving.  Once all have been purged, stop your cluster, and run a small migration script. The HBase migration script will upgrade the content of zookeeper and rearrange the content of the filesystem to support the new table namespaces feature.  The migration should take a few minutes at most.  Restart.  See Upgrading from 0.94.x to 0.96.x for details.

From here on out, 0.96.x point releases with bug fixes only will start showing up on a roughly monthly basis after the model established in our hbase-0.94 line.  hbase-0.98.0, our next major version, is scheduled to follow in short order (months).  You will be able to do a rolling restart off 0.96.x and up onto 0.98.0.  Guaranteed.

A big thanks goes out to all who helped make hbase-0.96.0 possible.

This release is dedicated to Shaneal Manek, HBase contributor.

Download your hbase-0.96.0 here.