Offheaping the Read Path in Apache HBase: Part 1 of 2
by HBase Committers Anoop Sam John, Ramkrishna S Vasudevan, and Michael Stack
Part one of a two part blog.
Introduction
By caching more data in memory, the read latency (and throughput) can be greatly improved. Apache HBase has two layers of data caching. There is what we call “L1” caching, our first caching tier – which caches data in an on heap LRU cache -- and then there is an optional “ L2” second cache tier (aka BucketCache), which can be configured to be on heap or off heap.
BucketCache can be configured to operate in one of three different modes: heap, offheap, and file. Regardless of operating mode, the BucketCache manages areas of memory called “buckets” in which it holds cached HBase HFile blocks. Each bucket is created with a target block size. The heap-based implementation creates these buckets on the JVM heap; the offheap implementation uses DirectByteByffers to host buckets outside of the JVM heap; file mode expects a path to a file on the filesystem wherein the buckets are created. The file mode is intended for use with a low-latency backing store – an in-memory filesystem, or perhaps a file sitting on SSD storage. It uses frequency of block access to inform utilization, just like HBase’s default LruBlockCache, and has its same single-access, multi-access, and in-memory breakdown of 25%, 50%, 25%. Also like the default cache, block eviction is managed using an LRU algorithm.
The on heap cache is limited by the maximum heap memory for the java process. If we give too much heap memory over to cache, we may starve other users of memory bringing on more GC. At extremes we could provoke Full GC pauses and so on up to OutOfMemoryErrors causing RegionServer restarts and attendant cluster turbulence. Keeping the bulk of the cache off heap takes pressure off the GC.
As stated above, we already have the “L2” BucketCache which can be configured to run off heap. But one downside of the original implementation was that when we read rows from the BucketCache, we had to copy the block which contained the row(s) onto the on heap area from the off heap buckets. This was due to the assumption, riddled throughout the code base, that the fundamental KeyValue HBase Type, was always backed by an on heap byte array. An HFile block has one or more KeyValues serialized into it. The default size for an HFile block is 64 KB (configurable, of course). An HFile block is what HBase reads from HDFS on each seek to pull in data into cache regardless of what the actual KeyValue size is (if a KeyValue is > 64 KB in size, that becomes the size of the HFile block we fish out of HDFS). 64 KB then is the increment in which we cache. So reading even one KeyValue from off heap, required the allocation of 64 KBs of contiguous on heap memory and then a copy of the bytes from off heap to on heap. This constant churn of bringing the off heap blocks on heap generated lots of garbage and in turn GC (See the GC spiral already noted above).
HBASE-11425 Cell/DBB end-to-end on the read-path changed HBase core so we can serve requests directly from off heap memory. A prepatory project converted much of HBase core to work with a new Cell Interface in place of our fundamental KeyValue type. The Cell Interface has API to refer to all the particles of the old KeyValue Type -- the row key, the column family, the column qualifier, etc. -- but the implementation goes unspecified. As part of HBASE-11425, an implementation of Cell was made that could reference off heap memory and then this instance was plumbed-in throughout the HBase read path. The result was substantial savings in heap allocations and data copying as well as improved read performance. Below is detail on the changes made in HBase core. Part two of this blog has some comparisons that show how this work has improved HBase read.
Selection of data structure for off heap storage
HBASE-11425 is about serving data directly from off heap buckets out of BucketCache without the need for copying. During reads, we parse individual Cell components multiple times decoding lengths and content of various particles (key, value, tag). Cells are frequently compared for proper sorting and ordering so it is important that the overhead added by the new refactor is minimal and that the cost is similar to what the current byte array backed Cell gives us.
Right now the BucketCache has its buckets in the form of NIO ByteBuffers. NIO buffer reads are known to be slower because of buffer boundary checks and so on. Netty’s ByteBuf is another buffering option. It has on heap as well as off heap flavors, just like NIO. Below are some JMH micro benchmark tests comparing Netty ByteBuf and NIO.
-
Take an NIO DirectByteBuffer (referred to as DBB in this document) and a Netty DirectByteBuf
-
Write a long, int and short values to both of these data structures.
-
JMH benchmark tests read back these long, int and short values.
Our test found that the reads from DBB are far better than Netty’s Direct ByteBuf. JMH test results on jdk 1.8.45 are shown below.
@State(Scope.Thread) public class NettyVsNio { static { io.netty.util.ResourceLeakDetector.setLevel(Level.DISABLED); } private ByteBuffer nioDBB = ByteBuffer.allocateDirect(14); private ByteBuf nettyDBB = UnpooledByteBufAllocator.DEFAULT.directBuffer(14); private Blackhole blackhole = new Blackhole(); @Setup public void setup() { nioDBB.putLong(1234L); nioDBB.putInt(100); nioDBB.putShort((short) 75); nettyDBB.writeLong(1234L); nettyDBB.writeInt(100); nettyDBB.writeShort(75); } @Benchmark public void nioOffheap() { blackhole.consume(nioDBB.getLong(0)); blackhole.consume(nioDBB.getInt(8)); blackhole.consume(nioDBB.getShort(12)); } @Benchmark public void nettyOffheap() { blackhole.consume(nettyDBB.getLong(0)); blackhole.consume(nettyDBB.getInt(8)); blackhole.consume(nettyDBB.getShort(12)); } public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include(NettyVsNio.class.getSimpleName()).warmupIterations(2) .measurementIterations(4).forks(1).build(); new Runner(opt).run(); } } Result "nettyOffheap": 57366360.944 ±(99.9%) 11533933.769 ops/s [Average] Result "nioOffheap": 60089837.738 ±(99.9%) 14171768.229 ops/s [Average]
The performance of the reads can be improved by using Unsafe based reads, bypassing the DBB boundary check paths (See Bytes where we already do such optimizations for byte array reads and writes). We have done similar tests with Unsafe based optimizations on NIO DBB and Netty ByteBuf. The performance, as expected, is almost identical as both follow the same code path as Unsafe API reads. NIO DBB has the edge here also.
Result "nettyOffheap": 83613659.416 ±(99.9%) 535211.991 ops/s [Average] Result "nioOffheap": 84514777.734 ±(99.9%) 1199369.976 ops/s [Average]
Based on the above tests, we decided to stick with the existing NIO DBB based buckets in BucketCache.
Cell extension on the server side
As discussed above, when we directly serve Cells from off heap memory, we will need a new set of APIs for retrieving Cell components (such as row key, column family, etc.) that are off heap. When the Cell getXXXArray APIs -- i.e. getRowArray, getFamilyArray, etc., see the Cell API -- are used, they return a byte array (byte []). If data is off heap, we would need to copy the Cell component bytes to a temporary on heap byte array whenever any getXXXArray method is called. Such an on heap allocation and copy would undo any benefit of our offheaping effort. Since our data is off heap in NIO ByteBuffers, we need APIs to return NIO ByteBuffers added to the Cell Interface. This way we can avoid copying Cell component bytes on heap and instead directly return references to the off heap data. Corresponding to each of the getXXXArray() APIs we added an equivalent ByteBuffer backed API.
The new buffer based APIs required are
getRowByteBuffer() getFamilyByteBuffer() getQualifierByteBuffer() getValueByteBuffer() getTagsByteBuffer()
The off heap backed Cells are used server side only. We don’t want to add the extra APIs to the public Cell interface used heavily by existing HBase clients. Instead we extended the interface for Cell with ByteBufferedCell and added these new methods here on this private server-side Type.
Along with all getXXXArray() methods, we have getXXXOffset() and getXXXLength() APIs in Cell so that one can refer to each component’s particular byte range in the backing byte array. In Java’s ByteBuffer, the ByteBuffer object itself can carry the offset, position and length info. But we decided against using ByteBuffers’ methods. We would have had to duplicate the backing ByteBuffer on each method invocation and then set each Cell particle offset and length into the new duplicated ByteBuffer. This duplication (extra object creation) would cause a lot of short-lived Java object creation. Also setting position and limit on ByteBuffer involves ByteBuffer limit/capacity checks.
Instead, we added a new set of getter APIs for the offset to be used with the returned ByteBuffers. These are
int getRowPosition() int getFamilyPosition() int getQualifierPosition() int getValuePosition() int getTagsPosition()
When any code uses getXXXByteBuffer(), it has to use the corresponding getXXXPosition() and the existing getXXXLength to determine the range of component bytes referenced by the returned ByteBuffer. Below is the new ByteBufferCell class in its entirety:
public abstract class ByteBufferedCell implements Cell { abstract ByteBuffer getRowByteBuffer(); abstract int getRowPosition(); abstract ByteBuffer getFamilyByteBuffer(); abstract int getFamilyPosition (); abstract ByteBuffer getQualifierByteBuffer(); abstract int getQualifierPosition (); abstract ByteBuffer getValueByteBuffer(); abstract int getValuePosition (); abstract ByteBuffer getTagsByteBuffer(); abstract int getTagsPosition (); }
It is a little awkward that the Cell interface getXXXOffset() method and the ByteBufferedCell method getXXXPosition() are similar in intent -- getting the data offset or current position in the buffer --- but one is for use when the Cell is byte array backed and the latter is for when we are returning ByteBuffers. We could have let getXXXOffset() work against the returned ByteBuffer but then we would again need to do slice()/duplicate() on the underlying ByteBuffer for every returned Cell particle ByteBuffer. This again would result in creating a lot of short lived objects. We therefore opted to add a method per return type.
The other thing to note is that the implementation of ByteBufferedCell does not throw an exception when getXXXArray() is used with getXXXOffset() though the Cell might be backed by off heap data: i.e. ByteBuffers. In this case, if invoked, internally we copy the contents of the ByteBuffer to an on heap byte[] and use zero as the returned offset. We did this so we did not have to convert every single location in the HBase code base; tests and example code will continue to work; they will just be making an expensive copy under the covers. This seemed the least obnoxious alternative; any code paths that are doing getXXXArray when they should be doing getXXXByteBuffer will show hot in the profiler and will get converted.
Also note that the ByteBufferedCell extension to the Cell interface is done as an abstract class not as an interface. We found the former form to be more performant. As discussed above, we have to use buffer backed APIs instead of array backed when the Cell is an off heap one. The off heap Cell implementation will be an instance of the new ByteBufferedCell extension. So we will need instance checks in a key locations to decide which path to use processing reads. CellComparators, etc., would need to perform many such instance checks as Cells bubble-up through the server. A JMH class tested CellComparator#compare(Cell, Cell) when both sides are array backed and then both are ByteBuffer backed. The compare logic is as below.
compareRows (Cell c1, Cell c2) { if(c1 instance of ByteBuffredCell && c2 instance of ByteBuffredCell){ // return compare based on BB method on both } if(c1 instance of ByteBuffredCell){ // return based on BB based method on c1 and array based on c2 } if(c2 instance of ByteBuffredCell){ // return based on BB based method on c2 and array based on c1 } // return compare based on array method on both }
We ran the test passing different combinations of Cell types.
Benchmark Mode Cnt Score Error Units InstanceOfCost.withoutInstanceOf thrpt 12 42560522.072 ± 1921505.003 ops/s InstanceOfCost.withInstanceOfBothBBCells thrpt 12 41832839.485 ± 1945841.856 ops/s InstanceOfCost.withInstanceOfBothCells thrpt 12 15500488.062 ± 220373.805 ops/s
"withoutInstanceOf” is the old case where compare logic doesn’t have any instance check at all. In case #2 and #3, there are instance checks as above and we pass both ByteBuffer backed cells and array backed cells respectively. We can see significant throughput decrease for case #3.
Later, the same test was repeated with Cell extension as an Abstract class rather than as an Interface. Here for any combination of Cell types, we get similar throughput which is almost same as for the old case (ie. withoutInstanceOf). The difference was because of Java L2 compiler inlining which was not happening when the Interface based Cell extension was in place.
Comparators
We have purged the old KeyValue-based KVComparator and its child classes (Refer to HBASE-10800) The compare operations in these classes assumed a KeyValue Type layout for keys (e.g. “ROW:COLUMN_FAMILY:COLUMN_QUALIFIER…” etc.). Instead we have introduced CellComparator. Internally it compares using getXXXArray() or getXXXByteBuffer() API as appropriate when comparing
Unsafe based ByteBuffer reads
The performance of the reads can be improved further. We have provided a Java Unsafe way of manipulating the APIs so that we get better performance by avoiding range checks, etc., done internally by ByteBuffers. This technique is not new in HBase because we already have Unsafe support for byte array manipulations. Comparisons dones with ByteBuffers use the Unsafe based compares which makes them faster.
We have also done a JMH micro benchmark tests for comparing the throughput of byte array comparison vs ByteBuffer comparison.
Two byte arrays are created with 135 bytes each and both arrays have the same bytes. We compare these two byte arrays using org.apache.hadoop.hbase.util.Bytes#compareTo which does an Unsafe based compare.
Two direct byte buffers (off heap) are created with a capacity of 135 bytes and both have the same content. The compare of these two byte buffers uses a newly created method, org.apache.hadoop.hbase.util.ByteBufferUtils#compare. We use a similar Unsafe based optimization here also.
Results in Java 8 Result "offheapCompare": 38205893.545 ±(99.9%) 265309.769 ops/s [Average] (min, avg, max) = (38164316.169, 38205893.545, 38246801.754), stdev = 41056.980 CI (99.9%): [37940583.776, 38471203.313] (assumes normal distribution) Result "onheapCompare": 31166847.740 ±(99.9%) 430242.970 ops/s [Average] (min, avg, max) = (31069672.467, 31166847.740, 31214107.795), stdev = 66580.576 CI (99.9%): [30736604.770, 31597090.710] (assumes normal distribution)
So we can see from the micro benchmarks results that Unsafe based comparisons can be performed on both on heap and off heap ByteBuffers and they perform better than normal non-Unsafe based comparisons.
ByteBuff wrapper Interface over NIO buffers
When an HFile block is available in the BucketCache, its data might be split across more than one bucket. Each bucket has a ByteBuffer of various sizes starting at 4 MB. By default our HFile block size is well below this (ie. 64 KB). We can see that there will be more than one block served by a single bucket buffer. There can be some blocks with data beginning at the edge of a bucket that span into the next bucket also. We want to avoid any sort of copy when reading blocks out of BucketCache and want a single data structure that can span multiple NIO buffers. Since NIO ByteBuffers cannot be sub-classed we could not use an extension to wrap multiple ByteBuffers. So we have added a new Interface (With a subset of the APIs from ByteBuffer) named ByteBuff. We have 2 implementations for it. When the backing is by a single ByteBuffer (most of the cases from BucketCache and also in cases such as the on heap LRU L1 cache), we make a SingleByteBuff instance. This just delegates calls to the wrapped NIO ByteBuffer. We also have a MultiByteBuff when the data spans across more than one ByteBuffer. The HFileBlock’s data structure type has been changed to be a ByteBuff.
Prevention of Block Eviction from BucketCache when in use
BucketCache evicts blocks and frees the associated buckets when it runs out of space for caching new blocks. Before HBASE-11425, any block could be evicted anytime as we copied the buckets’ content to a new ByteBuffer when a block was read from the BlockCache. Since HBASE-11425, we go out of our way to avoid this copy on to the heap of the BlockCache content. The data is instead served directly from the BucketCache’s buckets. So we have to prevent the eviction of blocks that are being actively read. We use a simple reference counting mechanism. The reference count of a block is incremented whenever it is read from the Cache (Using the getBlock API). We have added a new API to BlockCache to return a block back to Cache after the read on that block is over. The reference count is decremented then. The block can be evicted iff its reference count is zero.
The reads that we perform refer to Cells that are serialized in these buckets. So as long as a scanner is still running, these blocks cannot be evicted. Eviction would mean that the Cells that are referred to by the reads could be corrupted as eviction would have changed the backing blockbytes.
The reference to these Cells in the buckets will be removed only after the Cells have been passed back to the Client. In the RPC tier, a CellBlock is created in the RpcServer where the cached Cells that help make up client results are copied -- encoded and compressed as appropriate -- from the referenced BucketCache buckets before being passed back to the client (as protobufs). At this point, all the references to the buckets are considered freed. Reference counting also means we delay the close of the scanners until the RPC layer is done writing out the read content.
In part two, we detail comparison of on and off heap read paths.