by Anastasia Braginsky (HBase Committer), Eshcar Hillel (HBase Committer) and Edward Bortnikov (Contributor) of Yahoo! Research

Modern products powered by HBase exhibit ever-increasing  expectations from its read and write performance. Ideally, HBase applications would like to enjoy the speed of in-memory databases without giving up on the reliable persistent storage guarantees. We introduce a new algorithm in HBase 2.0, named Accordion, which takes a significant step towards this goal.

HBase partitions the data into regions controlled by a cluster of RegionServer’s. The internal (vertical) scalability of RegionServer is crucial for end-user performance as well as for the overall system utilization. Accordion improves the RegionServer scalability via a better use of RAM. It accommodates more data in memory and writes to disk less frequently. This manifests in multiple desirable phenomena. First, HBase’s disk occupancy and write amplification are reduced. Second, more reads and writes get served from RAM, and less are stalled by disk I/O - in other words, HBase’s performance is increased. Traditionally, these different metrics were considered at odds, and tuned at each other’s expense. With Accordion, they all get improved simultaneously.

Accordion is inspired by the Log-Structured-Merge (LSM) tree design pattern that governs the HBase storage organization. An HBase region is stored as a sequence of searchable key-value maps. The topmost is a mutable in-memory store, called MemStore, which absorbs the recent write (put) operations. The rest are immutable HDFS files, called HFiles. Once a MemStore overflows, it is flushed to disk, creating a new HFile. HBase adopts the multi-versioned concurrency control, that is, MemStore stores all data modifications as separate versions. Multiple versions of one key may therefore reside in MemStore and the HFile tier. A read (get) operation, which retrieves the value by key, scans the HFile data in BlockCache, seeking for the latest version. To reduce the number of disk accesses, HFiles are merged in the background. This process, called compaction, removes the redundant cells and creates larger files.

LSM trees deliver superior write performance by transforming random application-level I/O to sequential disk I/O. However, their traditional design makes no attempt to compact the in-memory data. This stems from historical reasons: LSM trees have been designed in the age when RAM was very short resource, therefore the MemStore capacity was small. With recent changes in the hardware landscape, the overall MemStore memstore managed by RegionServer can be multiple gigabytes, leaving a lot of headroom for optimization.

Accordion reapplies the LSM principle to MemStore, in order to eliminate redundancies and other overhead while the data is still in RAM. Doing so decreases the frequency of flushes to HDFS, thereby reducing the write amplification and the overall disk footprint. With less flushes, the write operations are stalled less frequently as the MemStore overflows, therefore the write performance is improved. Less data on disk also implies less pressure on the block cache, higher hit rates, and eventually better read response times. Finally, having less disk writes also means having less compaction happening in the background, i.e., less cycles are stolen from productive (read and write) work. All in all, the effect of in-memory compaction can be envisioned as a catalyst that enables the system move faster as a whole.

Accordion currently provides two levels of in-memory compaction - basic and eager. The former applies generic optimizations that are good for all data update patterns. The latter is most useful for applications with high data churn, like producer-consumer queues, shopping carts, shared counters, etc. All these use cases feature frequent updates of the same keys, which generate multiple redundant versions that the algorithm takes advantage of to provide more value. On the flip side, eager optimization may incur compute overhead (more memory copies and garbage collection), which may affect response times under intensive write loads. The overhead is high if the MemStore uses on-heap MemStore-Local Allocation Buffer (MSLAB) allocation; this configuration is not advised in conjunction with eager compaction. See more details about Accordion’s compaction algorithms in the next sections.

Future implementations may tune the optimal compaction policy automatically, based on the observed workload.

How To Use

The in-memory compaction level can be configured both globally and per column family. The supported levels are none (legacy implementation), basic, and eager.

By default, all tables apply basic in-memory compaction. This global configuration can be overridden in hbase-site.xml, as follows:

hbase.hregion.compacting.memstore.type

The level can also be configured in the HBase shell per column family, as follows:  

create ‘
’,

{NAME => ‘’, IN_MEMORY_COMPACTION => ‘}

Performance Gains, or Why You Should Care

We stress-tested HBase extensively via the popular Yahoo Cloud Service Benchmark (YCSB). Our experiments used 100-200 GB datasets, and exercised a variety of representative workloads. The results demonstrate significant performance gains delivered by Accordion.

Heavy-tailed (Zipf) distribution. The first experiment exercises a workload in which the key popularities follow the Zipf distribution that arises in most of the real-life scenarios. In this context, when 100% of the operations are writes, Accordion achieves up to 30% reduction of write amplification, 20% increase of write throughput, and 22% reduction of GC. When 50% of the operations are reads, the tail read latency is reduced by 12%.

Uniform distribution. The second experiment exercises a workload in which all keys are equally popular. In this context, under 100% writes, Accordion delivers up to 25% reduction of write amplification, 50% increase of write throughput, and 36% reduction of GC. The tail read latencies are not impacted (which is expected, due to complete lack of locality).

How Accordion Works

High Level Design. Accordion introduces CompactingMemStore - a MemStore implementation that applies compaction internally. Contrast to the default MemStore, which maintains all data in one monolithic data structure, Accordion manages it as a sequence of segments. The youngest segment, called active, is mutable; it absorbs the put operations. Upon overflow (by default, 32MB - 25% of the MemStore size bound), the active segment is moved to an in-memory pipeline, and becomes immutable. We call this in-memory flush. Get operations scan through these segments and the HFiles (the latter are accessed via the block cache, as usual in HBase).

CompactingMemStore may merge multiple immutable segments in the background from time to time, creating larger and leaner segments. The pipeline is therefore “breathing” (expanding and contracting), similar to accordion bellows.

When RegionServer decides to flush one or more MemStore’s to disk to free up memory, it considers the CompactingMemStore’s after the rest that have overflown. The rationale is to prolong the lifetime of MemStore’s that manage their memory efficiently, in order to reduce the overall I/O. When such a flush does happen, all pipeline segments are moved to a composite snapshot,  merged, and streamed to a new HFile.

Figure 1 illustrates the structure of CompactingMemStore versus the traditional design.

Figure 1. CompactingMemStore vs DefaultMemStore

Segment Structure. Similarly to the default MemStore, CompactingMemStore maintains an index on top of cell storage, to allow fast search by key. Traditionally, this index was implemented as a Java skiplist  (ConcurrentSkipListMap) - a dynamic but wasteful data structure that manages a lot of small objects. CompactingMemStore uses a space-efficient flat layout for immutable segment indexes. This universal optimization helps all compaction policies reduce the RAM overhead, even when the data has little-to-none redundancies. Once a segment is added to the pipeline, the store serializes its index into a sorted array named CellArrayMap that is amenable to fast binary search.

CellArrayMap supports both direct allocation of cells from the Java heap and custom allocation from MSLAB’s - either on-heap or off-heap. The implementation differences are abstracted away via the helper KeyValue objects that are referenced from the index (Figure 2). CellArrayMap itself is always allocated on-heap.

Figure 2. Immutable segment with a flat CellArrayMap index and MSLAB cell storage.

Compaction Algorithms. The in-memory compaction algorithms maintains a single flat index on top of the pipelined segments. This saves space, especially when the data items are small, and therefore pushes the disk flush further off away in time. A single index allows searching in one place, therefore bounding the tail read latency.

When an active segment is flushed to memory, it is queued to the compaction pipeline, and a background merge task is immediately scheduled. The latter simultaneously scans all the segments in the pipeline (similarly to on-disk compaction) and merges their indexes into one. The differences between the basic and eager compaction policies manifest in how they handle the cell data. Basic compaction does not eliminate the redundant data versions in order to  avoid physical copy; it just rearranges the references the KeyValue objects. Eager compaction, on the contrary, filters out the duplicates. This comes at the cost of extra compute and data migration - for example, with MSLAB storage the surviving cells are copied to the newly created MSLAB(s). The compaction overhead pays off when the data is highly redundant.

Future implementations of compaction may automate the choice between the basic and eager compaction policies. For example, the algorithm might try eager compaction once in awhile, and schedule the next compaction based on the value delivered (i.e., fraction of data eliminated). Such an approach could relieve the system administrator from deciding a-priori, and adapt to changing access patterns.

Summary

In this blog post, we covered Accordion’s basic principles, configuration, performance gains, and some details of the in-memory compaction algorithms. The next post will focus on system internals for HBase developers.

We thank Michael Stack, Anoop Sam John and Ramkrishna Vasudevan for their continuous support that made this project happen.