Comparing BlockCache Deploys

St.Ack on August 7th, 2014

A report comparing BlockCache deploys in Apache HBase for issue HBASE-11323 BucketCache all the time! Attempts to roughly equate five different deploys and compare how well they do under four different loading types that vary from no cache-misses through to missing cache most of the time. Makes recommendation on when to use which deploy.

Prerequisite

In Nick Dimiduk's BlockCache 101 blog post, he details the different options available in HBase. We test what remains after purging SlabCache and the caching policy implementation DoubleBlockCache which have been deprecated in HBase 0.98 and removed in trunk because of Nick's findings and that of others.

Nick varies the JVM Heap+BlockCache sizes AND dataset size. This report keeps JVM Heap+BlockCache size constant and varies the dataset size only. Nick looks at the 99th percentile only. This article looks at that, as well as GC, throughput and loadings. Cell sizes in the following tests also vary between 1 byte and 256k in size.

Findings

If the dataset fits completely in cache, the default configuration, which uses the onheap LruBlockCache, performs best. GC is half that of the next most performant deploy type, CombinedBlockCache:Offheap with at least 20% more throughput.

Otherwise, if your cache is experiencing churn running a steady stream of evictions, move your block cache offheap using CombinedBlockCache in the offheap mode. See the BlockCache section in the HBase Reference Guide for how to enable this deploy. Offheap mode requires only one third to one half of the GC of LruBlockCache when evictions are happening. There is also less variance as the cache misses climb when offheap is deployed. CombinedBlockCache:file mode has better GC profile but less throughput than CombinedBlockCache:Offheap. Latency is slightly higher on CBC:Offheap -- heap objects to cache have to be serialized in and out of the offheap memory -- than with the default LruBlockCache but the 95/99th percentiles are slightly better. CBC:Offheap uses a bit more CPU than LruBlockCache. CombinedBlockCache:Offheap is limited only by the amount of RAM available. It is not subject to GC.

Test

The graphs to follow show results from five different deploys each run through four different loadings.

Five Deploy Variants

LruBlockCache The default BlockCache in place when you start up an unconfigured HBase install. With LruBlockCache, all blocks are loaded into the java heap. See BlockCache 101 and LruBlockCache for detail on the caching policy of LruBlockCache.
CombinedBlockCache:Offheap CombinedBlockCache deploys two tiers; an L1 which is an LruBlockCache instance to hold META blocks only (i.e. INDEX and BLOOM blocks), and an L2 tier which is an instance of BucketCache. In this offheap mode deploy, the BucketCache uses DirectByteBuffers to host a BlockCache outside of the JVM heap to host cached DATA blocks.
CombinedBlockCache:Onheap In this onheap ('heap' mode), the L2 cache is hosted inside the JVM heap, and appears to the JVM to be a single large allocation. Internally it is managed by an instance of BucketCache The L1 cache is an instance of LruBlockCache.
CombinedBlockCache:file In this mode, an L2 BucketCache instance puts DATA blocks into a file (hence 'file' mode) on a mounted tmpfs in this case.
CombinedBlockCache:METAonly No caching of DATA blocks (no L2 instance). DATA blocks are fetched every time. The INDEX blocks are loaded into an L1 LruBlockCache instance.

Memory is fixed for each deploy. The java heap is a small 8G to bring on distress earlier. For deploy type 1., the LruBlockCache is given 4G of the 8G JVM heap. For 2.-5. deploy types., the L1 LruHeapCache is 0.1 * 8G (~800MB), which is more than enough to host the dataset META blocks. This was confirmed by inspecting the Block Cache vitals displayed in the RegionServer UI. For 2., 3., and 4. deploy types, the L2 bucket cache is 4G. Deploy type 5.'s L2 is 0G (Used HBASE-11581 Add option so CombinedBlockCache L2 can be null (fscache)).

Four Loading Types

All cache hits all the time.
A small percentage of cache misses, with all misses inside the fscache soon after the test starts.
Lots of cache misses, all still inside the fscache soon after the test starts.
Mostly cache misses where many misses by-pass the fscache.

For each loading, we ran 25 clients reading randomly over 10M cells of zipfian varied cell sizes from 1 byte to 256k bytes over 21 regions hosted by a single RegionServer for 20 minutes. Clients would stay inside the cache size for loading 1., miss the cache at just under ~50% for loading 2., and so on.

font-family: Times; font-size: medium; margin-bottom: 0in; line-height: 18.239999771118164px;">The dataset was hosted on a single RegionServer on small HDFS cluster of 5 nodes. See below for detail on the setup.

Graphs

For each attribute -- GC, Throughput, i/o -- we have five charts across; one for each deploy. Each graph can be divided into four parts; the first quarter has the all-in-cache loading running for 20minutes, followed by the loading that has some cache misses (for twenty minutes), through to loading with mostly cache misses in the final quarter.

To find out what each color represents, read the legend. The legend is small in the below. To see a larger version, browse to this non-apache-roller version of this document and click on the images over there.

Concentrate on the first four graph types. The BlockCache Stats and I/O graphs add little other than confirming that the loadings ran the same across the different BlockCache deploys with fscache cutting in to soak up the seeks soon after start for all but the last mostly-cache-miss loading cases.

GC

Be sure to check the y-axis in the graphs below (you can click on chart to get a larger version). All profiles look basically the same but a check of the y-axis will show that for all but the no-cache-misses case, CombinedBlockCache:Offheap, the second deploy type, has the best GC profile (less GC is better!).

The GC for the CombinedBLockCache:Offheap deploy mode looks to be climbing as the test runs. See the Notes section on the end for further comment.

Throughput

CombinedBlockCache:Offheap is better unless there are few to no cache misses. In that case, LruBlockCache shines (I missed why there is the step in the LruBlockCache all-in-cache section of the graph)

Latency

Latency when in-cache

Same as above except that we do not show the last loading -- we show the first three only -- since the last loading of mostly cache misses skews the diagrams such that it is hard to compare latency when we are inside cache (BlockCache and fscache).

Load

Try aggregating the system and user CPUs when comparing.

BlockCache Stats

Curves are mostly the same except for the cache-no-DATA-blocks case. It has no L2 deployed.

I/O

Read profiles are about the same across all deploys and tests with a spike at first until the fscache comes around and covers. The exception is the final mostly-cache-misses case.

Setup

Master branch. 2.0.0-SNAPSHOT, r69039f8620f51444d9c62cfca9922baffe093610. Hadoop 2.4.1-SNAPSHOT.

5 nodes all the same with 48G and six disks. One master and one regionserver, each on distinct nodes, with 21 regions of 10M rows of zipf varied size -- 0 to 256k -- created as follows:

$ export HADOOP_CLASSPATH=/home/stack/conf_hbase:`./hbase-2.0.0-SNAPSHOT/bin/hbase classpath`
$ nohup  ./hadoop-2.4.1-SNAPSHOT/bin/hadoop --config /home/stack/conf_hadoop org.apache.hadoop.hbase.PerformanceEvaluation --valueSize=262144 --valueZipf --rows=100000 sequentialWrite 100 &

Here is my loading script. It performs 4 loadings: All in cache, just out of cache, a good bit out of cache and then mostly out of cache:

[stack@c2020 ~]$ more bin/bc_in_mem.sh
#!/bin/sh
HOME=/home/stack
testtype=$1
date=`date -u +"%Y-%m-%dT%H:%M:%SZ"
echo testtype=$testtype $date` >> nohup.out
HBASE_HOME=$HOME/hbase-2.0.0-SNAPSHOT
runtime=1200
clients=25
cycles=1000000
#for i in 38 76 304 1000; do
for i in 32 72 144 1000; do
  echo "`date` run size=${i}, clients=$clients ; $testtype time=$runtime size=$i" >> nohup.out
  timeout $runtime nohup ${HBASE_HOME}/bin/hbase --config /home/stack/conf_hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --valueSize=110000 --size=$i --cycles=$cycles randomRead $clients
done

Java version:

[stack@c2020 ~]$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

I enabled caching options by uncommenting these configuration properties in hbase-site.xml . The hfile.block.cache.size was set to .5 to keep math simple.



hfile.block.cache.size
0.5




  hbase.bucketcache.ioengine
  heap


  hbase.bucketcache.size
  4196

Notes

The CombinedBlockCache has some significant overhead (to be better quantified -- 10%?). There will be slop left over because buckets will not fill to the brim especially when block sizes vary. Observed running loadings.

I tried to bring on a FullGC by running for more than a day with LruBlockCache fielding mostly BlockCache misses and I failed. Thinking on it, the BlockCache only occupied 4G of an 8G heap. There was always elbow room available (Free block count and maxium block size allocatable settled down to a nice number and held constant). TODO: Retry but with less elbow room available.

Longer running CBC offheap test

Some of the tests above show disturbing GC tendencies -- ever rising GC -- so I ran tests for a longer period . It turns out that BucketCache throws an OOME in long-running tests. You need HBASE-11678 BucketCache ramCache fills heap after running a few hours (fixed in hbase-0.98.6+)

After the fix, the below test ran until the cluster was pul led out from under the loading:
gc over three days Thro ughput over three days
Here are some long running tests with bucket cache onheap (ioengine=heap). The throughput is less and GC is higher.

gc over three days onheap Throughput over three days onheap

Future

Does LruBlockCache get more erratic as heap size grows? Nick's post implies that it does.
Auto-sizing of L1, onheap cache.

HBASE-11331 [blockcache] lazy block decompression has a report attachedthat evaluates the current state of attached patch to keep blocks compressed while in the BlockCache. It finds that with the patch enabled, "More GC. More CPU. Slower. Less throughput. But less i/o.", but there is an issue with compression block pooling that likely impinges on performance.

Review serialization in and out of L2 cache. No serialization? Because we can take blocks from HDFS without copying onheap?