by Anastasia Braginsky (HBase Committer), Eshcar Hillel (HBase Committer) and Edward Bortnikov (Contributor) of Yahoo! Research

In-memory compaction (Accordion project) demonstrated sizable improvement in HBase’s write amplification and read/write performance. In this post, we describe the design behind Accordion’s algorithms, and how it fits within the HBase internals.

What’s New

Accordion affects the regionserver package. Its centerpiece component is the CompactingMemStore class, which inherits from AbstractMemStore, and is sibling to DefaultMemStore. In contrast with DefaultMemStore, which maintains a monolithic dynamic (mutable) index to cell storage, CompactingMemStore manages multiple indexes, ordered by creation time. The youngest index is mutable, whereas the rest are immutable.

Cell indexes are implemented as descendants of the CellSet class that provides the basic NavigableMap access to cells. In addition to the traditional ConcurrentSkipListMap mutable index, Accordion introduces an immutable CellArrayMap index - a space-efficient ordered array that uses binary search. CellArrayMap is allocated on heap.

Accordion introduces the Segment abstraction, which encapsulates the combination of the CellSet and associated metadata (time range tracker, MSLAB reference, size counters, etc.). Beforehand, these (gory) details were managed directly by the MemStore. The abstract Segment class manages a single CellSet and its metadata. It has two subclasses: MutableSegment and ImmutableSegment. The latter can either manage an immutable CellSet, or provide a read-only wrapper to a mutable CellSet. The CompositeImmutableSegment class extends ImmutableSegment; it provides a similar API for a fixed set of segments.

Segment’s are scannable. The traversal is provided by the SegmentScanner class that implements the KeyValueScanner interface. SegmentScanner exploits the NavigableMap API implemented by the CellSet encapsulated by the segment.

CompactingMemStore manages one MutableSegment (in what follows, active) and multiple ImmutableSegment’s. It supports the top-level scan mechanism via a list of SegmentScanner’s, each referring to one segment. In this context, the MemStoreScanner class became deprecated and was eliminated in HBase 2.0.

Figure 1 depicts the Segment and cell index (NavigableMap) class hierarchies.

Figure 1. Segment and cell index (NavigableMap) class hierarchies.

Immutable segments are created upon in-memory flush. Following this, they travel through an interim pipeline (CompactionPipeline class) to the snapshot buffer from where they are flushed to disk, and finally released. Pipeline is accessed in parallel by multiple tasks; in what follows, we discuss how its thread-safety and correctness are guaranteed. The snapshot is simpler because its content never changes; it is implemented as CompositeImmutableSegment.

In-memory flushes trigger in-memory compactions. The latter replace one or more segments in pipeline with semantically equivalent but more memory-efficient presentations. The MemStoreCompactor class is an algorithmic tool that implements the in-memory compaction policies. It uses the MemStoreSegmentsIterator helper class to traverse the segments. Figure 2 depicts the classes that implement in-memory compaction.

Figure 2. Classes that implement in-memory compaction.

The StoreScanner class implements a consistent scan mechanism for HRegion. It maintains a heap of KeyValueScanner’s to merge the MemStore data with the on-disk HFile data. CompactingMemStore returns a subset of these scanners (list of SegmentScanner instances) for all its Segment’s.

MemStoreCompactor exploits the same mechanism, via the MemStoreSegmentsIterator helper; it only iterates through immutable segments. Figure 3 depicts the classes involved in in-memory compaction.

Figure 3. Classes involved in in-memory compaction.

Managing the Compacting Memstore State

MemStore’s in HBase run processing tasks concurrently with serving normal read and write requests - for example, flush data from RAM to disk. In CompactingMemStore, there are more concurrent scenarios, with in-memory flushes and compactions introducing more complexity. Here, pipeline is the most complex since it is accessed by multiple tasks in parallel.

Our guiding principles are:

Correctness. Data retrieval semantics are preserved - in particular, data is never lost.
Performance. Infrequent flushes and compactions, which happen in the background, do not affect the datapath operations, namely scans.

Let us give a quick look at how these principles manifest in the CompactionPipeline design.

Data Structures and Synchronization

Pipeline contains a double-ended queue of ImmutableSegments’s ordered by segment creation time. It is accessed by scans (read) as well as flushes and compactions (update). Since the segments are immutable, it is sufficient to provide the reader with a clone of the queue. One way to go would be to clone upon each scan, under the protection of a reentrant shared lock. We chose a more efficient copy-on-write approach. Namely, only the update operations synchronize on the pipeline. Each update modifies the read-only copy of the queue (volatile reference). The subsequent scans retrieve their clone lock-free. Note that if some segments are removed from the queue by in-memory compaction or disk flush in parallel with an ongoing scan, correctness is not affected because the data does not disappear. Rather, it may be referenced from multiple locations (for instance, both pipeline and snapshot). The scan algorithm filters the duplicates.

In-memory compaction swaps one or more segments in the queue with new (compacted) segments. Similarly to scan, it is a long-running operation, which should not disrupt the concurrent datapath operations. In order to achieve this, we implemented compaction in a non-blocking way. CompactionPipeline maintains a version that is promoted each time the queue tail is modified. When the compaction starts, it records this version. Upon completion, it atomically checks whether the version changed in the meantime, and atomically swaps the segments if it did not. This opportunistic approach succeed in most cases. Since in-memory compaction is an optimization, it is fine for it to fail on rare occasions. The version counter (long) is volatile - that is, changes to it are atomic and immediately observable.

Detailed Scenarios

Scan Operation (in particular, Get). The SegmentScanner’s are created (non-atomically) in the order of data movement between the MemStore segments, to preserve correctness. For example, in the course of scanner set creation a segment can move from active to pipeline, in which case it will be referenced by two scanners - however, no data is lost. The merge algorithm eliminates the redundant results that stem from the overlap.

In-Memory Flush (happens when active overflows). A dedicated worker (1) blocks updates for the region (via RegionServicesForStores), (2) creates a new ImmutableSegment that wraps active, (3) atomically inserts it into pipeline, (4) creates a new MutableSegment and flips the active reference to it, (5) unblocks the updates, and (6) calls MemStoreCompactor.

Disk Flush (happens when the region overflows, and decides to free up space in RAM). A dedicated worker (1) forces in-memory flush (to guarantee there is at least one segment in the pipeline), (2) creates a new CompositeImmutableSegment from all segments in the read-only clone of pipeline and flips the snapshot reference, (3) atomically removes references to segments in snapshot from CompactionPipeline, and (4) scans snapshot (merge across multiple segments) and flushes the results to disk.

In-Memory Compaction (triggered by in-memory flush, except in the disk flush case). (1) Retrieves a versioned copy of pipeline, (2) builds a new (compacted) ImmutableSegment, (3) atomically, if the version did not change, swap one or more segments in pipeline with the new segment (swap target depends on the compaction policy, see below).

Note that all the atomic sections are extremely lightweight. They only include manipulation of a few references, and avoid any computation and copy.

In-Memory Compaction Policies

MemStoreCompactor provides two compaction policies: BASIC and EAGER.

The BASIC policy is a low-cost/low-overhead alternative that merges the indexes of all segments in pipeline into a single flat index. It does not eliminate redundancies, in order to avoid cell data copy. Namely, once the number of segments in pipeline exceeds N, the algorithm scans the CellSet’s of N+1 youngest segments in pipeline, and copies the KeyValue references to a new CellArrayMap. The scan retrieves all the KeyValue’s in the original CellSet’s ordered by key and version (non-SQM matcher).

The EAGER policy is a high-cost/high-reward alternative that both flattens the index and eliminates redundancies across segments. It scans all the segments in pipeline, and merges them into one segment encapsulating a new CellArrayMap index. Redundant data versions are eliminated in the course of scan (SQM matcher). If the MemStore uses MSLAB cell storage, then the data is copied to new (compact) MSLAB’s under the new index. This policy trades extra data copy and GC overhead for maximal memory efficiency.

Disk Flush Policy and WAL Truncation

HBase 2.0 introduces a notion of sloppy MemStore’s - that is, MemStore implementations that dynamically expand and contract their RAM footprint over time. CompactingMemStore is currently the only sloppy MemStore implementation. When a region triggers a flush to disk to free up memory, sloppy stores are the last candidates for flush. The rationale is that they manage their memory more efficiently than DefaultMemStore by over time, and therefore should be prioritized for remaining in RAM.

Disk flushes trigger WAL truncation (archiving), as the WAL entries corresponding to persisted data versions become obsolete. Region maintains the estimate of the lower bound (minimum sequence id) of non-flushed data among all its stores; the log entries below this bound can be safely removed. Prior to Accordion, this maintenance was simple. Since DefaultMemStore dumps the whole in-memory content to disk, the store-level minimum sequence id was reset when flush was scheduled, and re-installed by the first put operation to occur after the flush.

Since sloppy stores can flush in-memory data to disk partially (for example, CompactingMemStore can flush any suffix of CompactionPipeline) the minimum sequence id maintenance becomes more subtle, to avoid data loss. Namely, every segment maintains its own minimum sequence id, and therefore, the CompactingMemStore lower bound is the minimum among all segments. Note that this is just a conservative estimate. For example, an eager in-memory compaction that happens concurrently to a disk flush might eliminate redundant cells and thereby lift the lower bound. However, this estimate is safe because the value can only monotonously grow over time. It can be safely computed anytime; no atomicity is required while retrieving the segment lower bounds.

If the WAL grows too big despite the truncation efforts, the periodic LogRoller process kicks in and forces a full flush to disk. This generic mechanism guarantees that the recovery after crash does not need to replay the entire history, and also trims the WAL. In other words, however efficient, in-memory compaction does not eliminate disk flushes entirely - rather, it pushes them further into the future. Note that for when EAGER compaction is adopted, periodic flushing is even more important because the WAL stores all the data redundancies that are eliminated by the compaction algorithm.

Summary

In this blog post, we covered Accordion’s internals - new classes, relationships, and execution flows. We also zoomed in the synchronization scheme that guarantees thread-safety, and shed light on the compaction policy implementations.

We thank Michael Stack, Anoop Sam John and Ramkrishna Vasudevan for their continuous support that made this project happen.