By Andrew Purtell, HBase Committer and member of the Intel HBase Team


Apache HBase is “the Apache Hadoop database”, a horizontally scalable nonrelational datastore built on top of components offered by the Apache Hadoop ecosystem, notably Apache ZooKeeper and Apache Hadoop HDFS. Although HBase therefore offers first class Hadoop integration, and is often chosen for that reason, it has come into its own as a good choice for high scale data storage of record. HBase is often the kernel of big data infrastructure.

There are many advantages offered by HBase: it is free open source software, it offers linear and modular scalability, it has strictly consistent reads and writes, automatic and configurable sharding and automatic failover are core concerns, and it has a deliberate tolerance for operation on “commodity” hardware, which is a very significant benefit at high scale. As more organizations are faced with Big Data challenges, HBase is increasingly found in roles formerly occupied by traditional data management solutions, gaining new users with new perspectives and the requirements and challenges of new use cases. Some of those users have a strong interest in security. They may be a healthcare provider operating within a strict regulatory regime regarding the access and sharing of patient information. They might be a consumer web property in a jurisdiction with strong data privacy laws. They could be a military or government agency that must manage a strict separation between multiple levels of information classification.

Access Control Lists and Security Labels

For some time Apache HBase has offered a strong security model based on Kerberos authentication and access control lists (ACLs). When Yahoo first made a version of Hadoop capable of strong authentication available in 2009, my team and I at a former employer, a commercial computer security vendor, were very interested. We needed a scale out platform for distributed computation, but we also needed one that could provide assurance against access of information by unauthorized users or applications. Secure Hadoop was a good fit. We also found in HBase a datastore that could scale along with computation, and offered excellent Hadoop integration, except first we needed to add feature parity and integration with Secure Hadoop. Our work was contributed back to Apache as HBASE-2742 and ZOOKEEPER-938. We also contributed an ACL-based authorization engine as HBASE-3025, and the coprocessor framework upon which it was built as HBASE-2000 and others. As a result, in the Apache family of Big Data storage options, HBase was the first to offer strong authentication and access control. These features have been improved and evolved by the HBase community many times over since then.

An access control list, or ACL, is a list of permissions associated with an object. The ACL specifies which subjects (users or system processes) shall be granted access to those objects, as well as which operations are allowed. Each entry in an ACL describes a subject and the operation(s) the subject is entitled to perform. When a subject requests an operation, HBase first uses Hadoop’s strong authentication support, based on Kerberos, to verify the subject’s identity. HBase then finds the relevant ACLs, and checks the entries of those ACLs to decide whether or not the request is authorized. This is an access control model that provides a lot of assurance, and is a flexible way for describing security policy, but not one that addresses all possible needs. We can do more.

All values written to HBase are stored in what is known as a cell. (“Cell” is used interchangeably with “key-value” or “KeyValue”, mainly for legacy reasons.) Cells are identified by a multidimensional key: { row, column, qualifier, timestamp }. (“Column” is used interchangeably with “family” or “column family”.) The table is implicit in every key, even if not actually present, because all cells are written into columns, and every column belongs to a table. This forms a hierarchical relationship: 

table -> column family -> cell

Users can today grant individual access permissions to subjects on tables and column families. Table permissions apply to all column families and all cells within those families. Column family permissions apply to all cells within the given family. The permissions will be familiar to any DBA: R (read), W (write), C (create), X (execute), A (admin). However for various reasons, until today, cell level permissions were not supported.

Other high scale data storage options in the Apache family, notably Apache Accumulo, take a different approach. Accumulo has a data model almost identical to that of HBase, but implements a security mechanism called cell-level security. Every cell in an Accumulo store can have a label, stored effectively as part of the key, which is used to determine whether a value is visible to a given subject or not. The label is not an ACL, it is a different way of expressing security policy. An ACL says explicitly what subjects are authorized to do what. A label instead turns this on its head and describes the sensitivity of the information to a decision engine that then figures out if the subject is authorized to view data of that sensitivity based on (potentially, many) factors. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while enforcing strict separation between multiple levels of information classification. HBase users might approximate this model using ACLs, but it would be labor intensive and error prone.

New HBase Cell Security Features

Happily our team here at Intel has been busy extending HBase with cell level security features. First, contributed as HBASE-8496, HBase can now store arbitrary metadata for a cell, called tags, along with the cell. Then, as of HBASE-7662, HBase can store into and apply ACLs from cell tags, extending the current HBase ACL model down to the cell. Then, as of HBASE-7663, HBase can store visibility expressions into tags, providing cell-level security capabilities similar to Apache Accumulo, with API and shell support that will be familiar to Accumulo users. Finally, we have also contributed transparent server side encryption, as HBASE-7544, for additional assurance against accidental leakage of data at rest. We are working with the HBase community to make these features available in the next major release of HBase, 0.98.

Let’s talk a bit more now about HBase visibility labels and per-cell ACLs work.

HFile version 3

In order to take advantage of any cell level access control features, it will be necessary to store data in the new HFile version, 3. HFile version 3 is very similar to HFile version 2 except it also has support for persisting and retrieving cell tags, optional dictionary based compression of tag contents, and the HBASE-7544 encryption feature. Enabling HFile version 3 is as easy as adding a single configuration setting to all HBase site XML files, followed by a rolling restart. All existing HFile version 2 files will be read normally. New files will be written in the version 3 format. Although HFile version 3 will be marked as experimental throughout the HBase 0.98 release cycle, we have found it to be very stable under high stress conditions on our test clusters.

HBase Visibility Labels

We have introduced a new coprocessor, the VisibilityController, which can be used on its own or in conjunction with HBase’s AccessController (responsible for ACL handling). The VisibilityController determines, based on label metadata stored in the cell tag and associated with a given subject, if the user is authorized to view the cell. The maximal set of labels granted to a user is managed by new shell commands getauths, setauths, and clearauths, and stored in a new HBase system table. Accumulo users will find the new HBase shell commands familiar.

When storing or mutating a cell, the HBase user can now add visibility expressions, using a backwards compatible extension to the HBase API. (By backwards compatible, we mean older servers will simply ignore the new cell metadata, as opposed to throw an exception or fail.)

Mutation#setCellVisibility(new CellVisibility(String labelExpession));

The visibility expression can contain labels joined with logical expressions ‘&’, ‘|’ and ‘!’. Also using ‘(‘, ‘)’ one can specify the precedence order. For example, consider the label set { confidential, secret, topsecret, probationary }, where the first three are sensitivity classifications and the last describes if an employee is probationary or not. If a cell is stored with this visibility expression:

( secret | topsecret ) & !probationary

Then any user associated with the secret or topsecret label will be able to view the cell, as long as the user is not also associated with the probationary label. Furthermore, any user only associated with the confidential label, whether probationary or not, will not see the cell or even know of its existence. Accumulo users will also find HBase visibility expressions familiar, but also providing a superset of boolean operators.

We build the user’s label set in the RPC context when a request is first received by the HBase RegionServer. How users are associated with labels is pluggable. The default plugin passes through labels specified in Authorizations added to the Get or Scan. This will also be familiar to Accumulo users. 

Get#setAuthorizations(new Authorizations(String,…));

Scan#setAuthorizations(new Authorizations(String,…));

Authorizations not in the maximal set of labels granted to the user are dropped. From this point, visibility expression processing is very fast, using set operations.

In the future we envision additional plugins which may interrogate an external source when building the effective label set for a user, for example LDAP or Active Directory. Consider our earlier example. Perhaps the sensitivity classifications are attached when cells are stored into HBase, but the probationary label, determined by the user’s employment status, is provided by an external authorization service.

HBase Cell ACLs

We have extended the existing HBase ACL model to the cell level.

When storing or mutating a cell, the HBase user can now add ACLs, using a backwards compatible extension to the HBase API.

Mutation#setACL(String user, Permission perms);

Like at the table or column family level, a subject is granted permissions to the cell. Any number of permissions for any number of users (or groups using @group notation) can be added.

From then on, access checks for operations on the cell include the permissions recorded in the cell’s ACL tag, using union-of-permission semantics.

First we check table or column family level permissions*. If they grant access, we can early out before going to blockcache or disk to check the cell for ACLs. Table designers and security architects can therefore optimize for the common case by granting users permissions at the table or column family level. However, if indeed some cells require more fine grained control, if neither table nor column family checks succeed, we will enumerate the cells covered by the operation. By “covered”, this means we insure that every location which would be visibly modified by the operation has a cell ACL in place that grants access. We can stop at the first cell ACL that does not grant access.

For a Put, Increment, or Append we check the permissions for the most recent visible version. “Visible” means not covered by a delete tombstone. We treat the ACLs in each Put as timestamped like any other HBase value. A new ACL in a new Put applies to that Put. It doesn't change the ACL of any previous Put. This allows simple evolution of security policy over time without requiring expensive updates. To change the ACL at a specific { row, column, qualifier, timestamp } coordinate, a new value with a new ACL must be stored to that location exactly.

For Increments and Appends, we do the same thing as with Puts, except we will propagate any ACLs on the previous value unless the operation carries a new ACL explicitly.

Finally, for a Delete, we require write permissions on all cells covered by the delete. Unlike in the case of other mutations we need to check all visible prior versions, because a major compaction could remove them. If the user doesn't have permission to overwrite any of the visible versions ("visible", again, is defined as not covered by a tombstone already) then we have to disallow the operation.

* - For the sake of coherent explanation, this overlooks an additional feature. Optionally, on a per-operation basis, how cell ACLs are factored into the authorization decision can be flipped around. Instead of first checking table or column family level permissions, we enumerate the set of ACLs covered by the operation first, and only if there are no grants there we check for table or column family permissions. This is useful for use cases where a user is not granted table or column family permissions on a table and instead the cell level ACLs provide exceptional access. The default is useful for use cases where the user is granted table or column family permissions and cell level ACLs might withhold authorization. The default is likely to perform better. Again, which strategy is used can be specified on a per-operation basis.


This blog post was republished from