By Doug Meil, HBase Committer and Thomas Murphy

Intro

One of the most
common questions in the HBase user community is estimating disk
footprint of tables, which translates into HFile size – the
internal file format in HBase.

We designed an
experiment at Explorys where we ran combinations of design time
options (rowkey length, column name length, row storage approach) and
runtime options (HBase ColumnFamily compression, HBase data block
encoding options) to determine these factors’ effects on the
resultant HFile size in HDFS.

HBase Environment

CDH4.3.0 (HBase
0.94.6.1)

Design Time
Choices

  1. Rowkey

    1. Thin

      1. 16-byte MD5
        hash of an integer.

    2. Fat

      1. 64-byte
        SHA-256 hash of an integer.

    1. Note: neither
      of these are realistic rowkeys for real applications, but they
      chosen because they are easy to generate and one is a lot bigger
      than the other.

  1. Column Names

    1. Thin

      1. 2-3 character
        column names (c1, c2).

    2. Fat

      1. 10
        characters, randomly chosen but consistent for all rows.

    1. Note: it is
      advisable to have small column names, but most people don’t start
      that way so we have this as an option.

  1. Row Storage
    Approach

    1. Key Value Per
      Column

      1. This is the
        traditional way of storing data in HBase.

    2. One Key Value
      per row.

      1. Actually,
        two.

      2. One KV has an
        Avro serialized byte array containing all the data from the row.

      3. Another KV
        holds an MD5 hash of the version of the Avro schema.

Run Time

  1. Column
    Family Compression

    1. None

    2. GZ

    3. LZ4

    4. LZO

    5. Snappy

    1. Note: it is
      generally advisable to use compression, but what if you didn’t?
      So we tested that too.

  1. HBase Block
    Encoding

    1. None

    2. Prefix

    3. Diff

    4. Fast Diff

    1. Note: most
      people aren’t familiar with HBase Data Block Encoding. Primarily
      intended for squeezing more data into the block cache, it has
      effects on HFile size too. See HBASE-4218 for more detail.

1000 rows were
generated for each combination of table parameters. Not a ton of
data, but we don’t necessarily need a ton of data to see the
varying size of the table. There were 30 columns per row comprised
of 10 strings (each filled with 20 bytes of random characters), 10
integers (random numbers) and 10 longs (also random numbers).

HBase blocksize was
128k.

Results

The easiest way to
navigate the results is to compare specific cases, progressing from
an initial implementation of a table to options for production.

Case #1: Fat
Rowkey and Fat Column Names, Now What?

This is where most
people start with HBase. Rowkeys are not as optimal as they should
be (i.e., the Fat rowkey case) and column names are also inflated
(Fat column-names).

Without CF
Compression or Data Block Encoding, the baseline is:

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY-FATCOL

6,293,670

1000

NONE

NONE

What if we just
changed CF compression?

This drastically
changes the HFile footprint. Snappy compression reduces the HFile
size from 6.2 Mb to 1.8 Mb, for example.

1,362,033

1000

GZ

NONE

1,803,240

1000

SNAPPY

NONE

1,919,265

1000

LZ4

NONE

1,950,306

1000

LZO

NONE

However, we
shouldn’t be too quick to celebrate. Remember that this is
just the disk footprint. Over the wire the data is
uncompressed, so 6.2 Mb is still being transferred from RegionServer
to Client when doing a Scan over the entire table.

What if we just
changed data block encoding?

Compression isn’t
the only option though. Even without compression, we can change the
data block encoding and also achieve HFile reduction. All options
are an improvement over the 6.2 Mb baseline.

1,491,000

1000

NONE

DIFF

1,492,155

1000

NONE

FAST_DIFF

2,244,963

1000

NONE

PREFIX

Combination

The following table
shows the results of all remaining CF compression / data block
encoding combinations.

1,146,675

1000

GZ

DIFF

1,200,471

1000

GZ

FAST_DIFF

1,274,265

1000

GZ

PREFIX

1,350,483

1000

SNAPPY

DIFF

1,358,190

1000

LZ4

DIFF

1,391,016

1000

SNAPPY

FAST_DIFF

1,402,614

1000

LZ4

FAST_DIFF

1,406,334

1000

LZO

FAST_DIFF

1,541,151

1000

SNAPPY

PREFIX

1,597,440

1000

LZO

PREFIX

1,622,313

1000

LZ4

PREFIX

Case #2: What if
we re-designed the column names (and left the rowkey alone)?

Let’s assume that
we re-designed our column names but left the rowkey alone. After
using the “thin” column-names without CF compression or data
block encoding it results in an HFile 5.8 Mb in size. This is an
improvement from the original 6.2 Mb baseline. It doesn’t seem
like much, but it’s still a 6.5% reduction in the eventual
wire-transfer footprint.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY

5,778,888

1000

NONE

NONE

Applying Snappy
compression can reduce the HFile size further:

1,349,451

1000

SNAPPY

DIFF

1,390,422

1000

SNAPPY

FAST_DIFF

1,536,540

1000

SNAPPY

PREFIX

1,785,480

1000

SNAPPY

NONE

Case #3: What if
we re-designed the rowkey (and left the column names alone)?

In this example,
what if we only redesigned the rowkey? After using the “thin”
rowkey the result is an HFile size that is 4.9 Mb down from the 6.2
Mb baseline, a 21% reduction. Not a small savings!

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATCOL

4,920,984

1000

NONE

NONE

Applying Snappy
compression can reduce the HFile size further:

1,295,895

1000

SNAPPY

DIFF

1,337,112

1000

SNAPPY

FAST_DIFF

1,489,446

1000

SNAPPY

PREFIX

1,739,871

1000

SNAPPY

NONE

However, note that
the resulting HFile size with Snappy and no data block encoding (1.7
Mb) is very similar in size to the baseline approach (i.e., fat
rowkeys, fat column-names) with Snappy and no data block encoding
(1.8 Mb). Why? The CF compression can compensate on disk for a lot
of bloat in rowkeys and column names.

Case #4: What if
we re-designed both the rowkey and the column names?

By this time we’ve
learned enough HBase to know that we need to have efficient rowkeys
and column-names. This produces an HFile that is 4.4 Mb, a 29%
savings over the baseline of 6.2 Mb.

4,406,418

1000

NONE

NONE

Applying Snappy
compression can reduce the HFile size further:

1,296,402

1000

SNAPPY

DIFF

1,338,135

1000

SNAPPY

FAST_DIFF

1,485,192

1000

SNAPPY

PREFIX

1,732,746

1000

SNAPPY

NONE

Again, the on-disk
footprint with compression isn’t radically different from the
others, as Compression can compensate to large degree for rowkey and
column name bloat.

Case #5:
KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)

What if we did
something radical and changed how we stored the data in HBase? With
this approach, we are using a single KeyValue per row holding all
of the columns of data for the row instead of a KeyValue per column
(the traditional way).

The resulting HFile,
even uncompressed and without Data Block Encoding, is radically
smaller at 1.4 Mb compared to 6.2 Mb.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-AVRO

1,374,465

1000

NONE

NONE

Adding Snappy
compression and Data Block Encoding makes the resulting HFile size
even smaller.

1,119,330

1000

SNAPPY

DIFF

1,129,209

1000

SNAPPY

FAST_DIFF

1,133,613

1000

SNAPPY

PREFIX

1,150,779

1000

SNAPPY

NONE

Compare the 1.1 Mb
Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin
column-name.

Summary

Although Compression
and Data Block Encoding can wallpaper over bad rowkey and column-name
decisions in terms of HFile size, you will pay the price for this in
terms of data transfer from RegionServer to Client. Also, concealing
the size penalty brings with it a performance penalty each time the
data is accessed or manipulated. So, the old advice about correctly
designing rowkeys and column names still holds.

In terms of KeyValue
approach, having a single KeyValue per row presents significant
savings both in terms of data transfer (RegionServer to Client) as
well as HFile size. However, there is a consequence with this
approach in having to update each row entirely, and that old
versions of rows also be stored in their entirety (i.e., as
opposed to column-by-column changes). Furthermore, it is impossible
to scan on select columns; the whole row must be retrieved and
deserialized to access any information stored in the row. The
importance of understanding this tradeoff cannot be over-stated, and
is something that must be evaluated on an application-by-application
basis.

Software engineering
is an art of managing tradeoffs, so there isn’t necessarily one
“best” answer. Importantly, this experiment only measures the
file size and not the time or processor load penalties imposed by the
use of compression, encoding, or Avro. The results generated in this
test are still based on certain assumptions and your mileage may
vary.

Here is the data if interested: http://people.apache.org/~dmeil/HBase_HFile_Size_2014_04.csv