Investing In Big Data: Apache HBase

This is the fourth in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Lars Hofhansl and Andrew Purtell are HBase Committers and PMC members. At Salesforce.com, Lars is a Vice President and Software Engineering Principal Architect, and Andrew is a Cloud Storage Architect.

An earlier version of the discussion in this post was published here on the Salesforce + Open Source = ❤ blog.

- Andrew Purtell

Investing In Big Data: Apache HBase

By © Earth at Night / CC BY 3.0

The world is good at making data. You can see it in every corner of commerce and industry: everything we interact with is getting smarter, and producing a massive stream of readings, geolocations, images, and more. From medical devices to jet engines, it’s transforming every part of the modern world. And it’s accelerating!

Salesforce’s products are all about helping our customers connect with their customers. So obviously, a big part of that is equipping them to interact with all of the data a customer’s interactions might generate, in whatever quantities it shows up in. We do that in many ways, across all of our products.

In this post, we’d like to zoom in on one particular Open Source data system we use and contribute heavily to: Apache HBase. If you’re not familiar with HBase, this post will start with a few high level concepts about the system, and then go into how it fits in at Salesforce.

What IS HBase?

HBase is an open source distributed database. It’s designed to store record-oriented data across a scalable cluster of machines. We sometimes refer to it as a “sparse, consistent, distributed, multi-dimensional, persistent, sorted map”. This usually makes people say “Wat??”. So, to break that down a little bit:

distributed : rows are spread over many machines;
consistent : it’s strongly consistent (at a per-row level);
persistent : it stores data on disk with a log, so it sticks around;
sorted : rows are stored in sorted order, making seeks very fast;
sparse : if a row has no value in a column, it doesn’t use any space;
multi-dimensional : data is addressable in several dimensions: tables, rows, columns, versions, etc.

Think of it as a giant list of key / value pairs, spread over a large cluster, sorted for easy access.

People often refer to HBase as a “NoSQL” store–a term coined back in 2009 to refer to a big cohort of similar systems that were doing data storage without SQL (Structured Query Language). Contributors to the HBase project will tell you they have always disliked that term, though; a toaster is also NoSQL. It’s better to say HBase is architected differently than a typical relational database engine, and can scale out better for some use cases.

Google was among the first companies to move in this direction, because they were operating at the scale of the entire web. So, they long ago built their infrastructure on top of a new kind of system (BigTable, which is the direct intellectual ancestor of HBase). It grew from there, with dozens of Open Source variants on the theme emerging in the space of a few years: Cassandra, MongoDB, Riak, Redis, CouchDB, etc.

Why use a NoSQL store, if you already have a relational database?

Let’s be clear: relational databases are terrific. They’ve been the dominant form of data storage on Earth for nearly three decades, and that’s not by accident. In particular, they give you one really killer feature: the ability to decompose the physical storage of data into different conceptual buckets (entities, aka tables), with relationships to each other … and to modify the state of many related values atomically (transactions). This incurs a cost (for finding and re-assembling the decomposed data when you want to read it) but relational database query planners have gotten so shockingly good that this is actually a perfectly good trade-off in most cases.

Salesforce is deeply dependent on relational databases. A majority of our row-oriented data still lives there, and they are integral to the basic functionality of the system. We’ve been able to scale them to massive load, both via our unique data storage architecture, and also by sharding our entire infrastructure (more on that in the The Crystal Shard, a post by Ian Varley on the Salesforce blog). They’re not going anywhere. On the contrary, we do constant and active research into new designs for scaling and running relational databases.

But, it turns out there are a subset of use cases that have very different requirements from relational data. In particular, less emphasis on webs of relationships that require complex transactions for correctness, and more emphasis on large streams of data that accrue over time, and need linear access characteristics. You certainly can store these in an RDBMS. But, when you do, you’re essentially paying a penalty (performance and scale limitations) for features you don’t need; a simpler design can scale more powerfully.

It’s for those new use cases–and all the customer value they unlock–that we’ve added HBase to our toolkit.

Of all the NoSQL stores, why HBase?

This question comes up a lot: people say, “I heard XYZ-database is web scale, you should use that!”. The world of “polyglot persistence” has produced a boatload of choices, and they all have their merits. In fact, we use almost all of them somewhere in the Salesforce product suite: Cassandra, Redis, CouchDB, MongoDB, etc.

To choose HBase as a key area of investment for Salesforce Core, we went through a pretty intense “bake-off” process that included evaluations of several different systems, with experiments, spikes, POCs, etc. The decision ultimately came down to three big points for us:

HBase is a strongly consistent store. In the CAP Theorem, that means it’s a (CP) store, not an (AP) store. Eventual consistency is great when used for the right purpose, but it can tend to push challenges up to the application developer. We didn’t think we’d be able to absorb that extra complexity, for general use in a product with such a large surface area.
It’s a high quality project. It did well in our benchmarks and tests, and is well respected in the community. Facebook built their entire Messaging infrastructure on HBase (as well as many other things), and the Open Source community is active and friendly.
The Hadoop ecosystem already had an operational presence at Salesforce. We’ve been using Hadoop in the product for ages, and we already had a pretty good handle on how to deploy and operate it. HBase can use Hadoop’s distributed filesystem for persistence and offers first class integration with MapReduce (and, coming soon, Spark), so is a way to level up existing Hadoop deployments with modest incremental effort.

Our experience was (and still is) that HBase wasn’t the easiest to get started with, but it was the most dependable at scale; we sometimes refer to it as the “industrial strength” scalable store, ready for use in demanding enterprise situations, and taking issues like data durability and security extremely seriously.

HBase (and its API) is also broadly used in the industry. HBase is an option on Amazon’s EMR, and is also available as part of Microsoft’s Azure offerings. Google Cloud includes a hosted BigTable service sporting the de-facto industry standard HBase client API. (It’s fair to say the HBase client API has widespread if not universal adoption for Hadoop and Cloud storage options, and will likely live on beyond the HBase engine proper.)

When does Salesforce use HBase?

There are three indicators we look at when deciding to run some functionality on HBase. We want things that are big, record-oriented, and transactionally independent.

By “big”, we mean things in the ballpark of hundreds of millions of rows per tenant, or more. HBase clusters in the wild have been known to grow to thousands of compute nodes, and hundreds of Petabytes, which is substantially bigger than we’d ever grow a single one of our instance relational databases to. (None of our individual clusters are that big yet, mind you.) Data size is not only not a challenge for HBase, it’s a desired feature!

By “record-oriented”, we mean that it “looks like” a database, not a file store. If your data can be modeled as big BLOBs, and you don’t need to independently read or write small bits of data in the middle of those big blobs, then you should use a file store.

Why does it matter if things are record-oriented? After all, HBase is built on top of HDFS, which is … a file store. The difference is that the essential function that HBase plays is specifically to let you treat big immutable files as if they were a mutable database. To do this magic, it uses something called a Log Structured Merge Tree, which provides for both fast reads and fast writes. (If that seems impossible, read up on LSMs; they’re legitimately impressive.)

And, what do we mean by “transactionally independent”? Above, we described that a key feature of relational databases is their transactional consistency: you can modify records in many different tables and have those modifications either commit or rollback as a unit. HBase, as a separate data store, doesn’t participate in these transactions, so for any data that spans both stores, it requires application developers to reason about consistency on their own. This is doable, but it is tricky. So, we prefer to emphasize those cases where that reasoning is simple by design.

(For the hair-splitters out there, note that HBase does offer consistent “transactions”, but only on a single-row basis, not across different rows, objects, or (most importantly) different databases.)

One criterion for using HBase that you’ll notice we didn’t mention is that as a “NoSQL” store, you wouldn’t use it for something that called for using SQL. The reason we didn’t mention that is that it isn’t true! We built and open sourced a library called “Phoenix” (which later became Apache Phoenix) which brings a SQL access layer to HBase. So, as they say, “We put the SQL back in NoSQL”.

We treat HBase as a “System Of Record”, which means that we depend on it to be every bit as secure, durable and available as our other data stores. Getting it there required a lot of work: accounting for site switching, data movement, authenticated access between subsystems, and more. But we’re glad we did!

What features does Salesforce use HBase for?

To give you a concrete sense of some of what HBase is used for at Salesforce, we’ll mention a couple of use cases briefly. We’ll also talk about more in future posts.

The main place you may have seen HBase in action in Salesforce to date is what we call Salesforce Shield. Shield is a connected suite of product features that enable enterprise businesses to encrypt data and track user actions. This is a requirement in certain protected sectors of business, like Health and Government, where compliance laws dictate the level of retention and oversight of access to data.

One dimension of this feature is called Field Audit Trail (FAT). In Salesforce, there’s an option that allows you to track changes made to fields (either directly by users through a web UI or mobile device, or via other applications through the API). This historical data is composed of “before” and “after” values for every tracked field of an object. These stick around for a long time, as they’re a matter of historical record. That means that if you have data that changes very frequently, this data set can grow rapidly, and without any particular bound. So we use HBase as a destination for moving older sets of audit data over, so it’s still accessible via the API, but doesn’t have any cost to the relational database optimizer. The same principle applies to other data that behaves the same way; we can archive this data into HBase as a cold store.

Another part of shield that uses HBase is the Event Monitoring feature. The initial deployment was based on Hadoop, and culls a subset of application log lines, making them available to customers directly as downloadable files. New work is in progress to capture some event data into HBase, like Login Forensics, in real time, so it can be queried interactively. This could be useful for security and audit, limit monitoring, and general usage tracking. We’re also introducing an asynchronous way to query this data so you can work with the potentially large resulting data sets in sensible ways.

More generally, the Platform teams at Salesforce have been cooking up a general way for customers to use HBase. It’s called BigObjects (more here). BigObjects behave in most (but not all) ways like standard relational-database-backed objects, and are ideally suited for use cases that require ingesting large amounts of read-only data from external systems: log files, POS data, event data, clickstreams, etc.

We’re adding more features that use HBase in every release, like improving mention relevancy in Chatter, tracking Apex unit test results, caching report data, and more. We’ll post follow-ups here and on the Salesforce Engineering Medium Channel about these as they evolve.

Contributing To HBase

So, we’ve got HBase clusters running in data centers all over the world, powering new features in the product, and running on commodity hardware that can scale linearly.

But perhaps more importantly, Salesforce has put a premium on not just being a user of HBase, but also being a contributor. The company employs multiple committers on the Apache project, including both of us (Lars is a committer and PMC member, and Andrew is an Apache VP and PMC Chair). We’ve written and committed hundreds of patches, including some major features. As committers, we’ve reviewed and committed thousands of patches, and served as release managers for the 0.94 release (Lars) and 0.98 release (Andrew).

We’ve also presented dozens of talks at conferences about HBase, including HBaseCon and OSCON.

We’re committed to doing our work in the community, not in a private fork except where temporarily needed for critical issues. That’s a key tenet of the team’s OSS philosophy — no forking! — that we’ll talk about more in an upcoming post on the Salesforce Engineering Medium Channel.

By the way, outside of the Core deployment of HBase that we’ve described here, there are actually a number of other places in the larger Salesforce universe where HBase is deployed as well, including Data.com, the Marketing Cloud (more here), and Argus, an internal time-series monitoring service. More on these uses soon, too!

Conclusion

We’re happy to have been a contributing part of the HBase community these last few years, and we’re looking forward to even more good stuff to come.