The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size

By Doug Meil, HBase Committer and Thomas Murphy

Intro

One of the most
common questions in the HBase user community is estimating disk
footprint of tables, which translates into HFile size – the
internal file format in HBase.

We designed an
experiment at Explorys where we ran combinations of design time
options (rowkey length, column name length, row storage approach) and
runtime options (HBase ColumnFamily compression, HBase data block
encoding options) to determine these factors’ effects on the
resultant HFile size in HDFS.

HBase Environment

CDH4.3.0 (HBase
0.94.6.1)

Design Time
Choices

Rowkey
1. Thin
  1. 16-byte MD5
    hash of an integer.
2. Fat
  1. 64-byte
    SHA-256 hash of an integer.

Note: neither
of these are realistic rowkeys for real applications, but they
chosen because they are easy to generate and one is a lot bigger
than the other.

Column Names
1. Thin
  1. 2-3 character
    column names (c1, c2).
2. Fat
  1. 10
    characters, randomly chosen but consistent for all rows.

Note: it is
advisable to have small column names, but most people don’t start
that way so we have this as an option.

Row Storage
Approach
1. Key Value Per
  Column
  1. This is the
    traditional way of storing data in HBase.
2. One Key Value
  per row.
  1. Actually,
    two.
  2. One KV has an
    Avro serialized byte array containing all the data from the row.
  3. Another KV
    holds an MD5 hash of the version of the Avro schema.

Run Time

Column
Family Compression
1. None
2. GZ
3. LZ4
4. LZO
5. Snappy

Note: it is
generally advisable to use compression, but what if you didn’t?
So we tested that too.

HBase Block
Encoding
1. None
2. Prefix
3. Diff
4. Fast Diff

Note: most
people aren’t familiar with HBase Data Block Encoding. Primarily
intended for squeezing more data into the block cache, it has
effects on HFile size too. See HBASE-4218 for more detail.

1000 rows were
generated for each combination of table parameters. Not a ton of
data, but we don’t necessarily need a ton of data to see the
varying size of the table. There were 30 columns per row comprised
of 10 strings (each filled with 20 bytes of random characters), 10
integers (random numbers) and 10 longs (also random numbers).

HBase blocksize was
128k.

Results

The easiest way to
navigate the results is to compare specific cases, progressing from
an initial implementation of a table to options for production.

Case #1: Fat
Rowkey and Fat Column Names, Now What?

This is where most
people start with HBase. Rowkeys are not as optimal as they should
be (i.e., the Fat rowkey case) and column names are also inflated
(Fat column-names).

Without CF
Compression or Data Block Encoding, the baseline is:

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY-FATCOL

6,293,670

1000

NONE

What if we just
changed CF compression?

This drastically
changes the HFile footprint. Snappy compression reduces the HFile
size from 6.2 Mb to 1.8 Mb, for example.

1,362,033	1000	GZ	NONE
1,803,240	1000	SNAPPY	NONE
1,919,265	1000	LZ4	NONE
1,950,306	1000	LZO	NONE

However, we
shouldn’t be too quick to celebrate. Remember that this is
just the disk footprint. Over the wire the data is
uncompressed, so 6.2 Mb is still being transferred from RegionServer
to Client when doing a Scan over the entire table.

What if we just
changed data block encoding?

Compression isn’t
the only option though. Even without compression, we can change the
data block encoding and also achieve HFile reduction. All options
are an improvement over the 6.2 Mb baseline.

1,491,000	1000	NONE	DIFF
1,492,155	1000	NONE	FAST_DIFF
2,244,963	1000	NONE	PREFIX

Combination

The following table
shows the results of all remaining CF compression / data block
encoding combinations.

1,146,675	1000	GZ	DIFF
1,200,471	1000	GZ	FAST_DIFF
1,274,265	1000	GZ	PREFIX
1,350,483	1000	SNAPPY	DIFF
1,358,190	1000	LZ4	DIFF
1,391,016	1000	SNAPPY	FAST_DIFF
1,402,614	1000	LZ4	FAST_DIFF
1,406,334	1000	LZO	FAST_DIFF
1,541,151	1000	SNAPPY	PREFIX
1,597,440	1000	LZO	PREFIX
1,622,313	1000	LZ4	PREFIX

Case #2: What if
we re-designed the column names (and left the rowkey alone)?

Let’s assume that
we re-designed our column names but left the rowkey alone. After
using the “thin” column-names without CF compression or data
block encoding it results in an HFile 5.8 Mb in size. This is an
improvement from the original 6.2 Mb baseline. It doesn’t seem
like much, but it’s still a 6.5% reduction in the eventual
wire-transfer footprint.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY

5,778,888

1000

NONE

Applying Snappy
compression can reduce the HFile size further:

1,349,451	1000	SNAPPY	DIFF
1,390,422	1000	SNAPPY	FAST_DIFF
1,536,540	1000	SNAPPY	PREFIX
1,785,480	1000	SNAPPY	NONE

Case #3: What if
we re-designed the rowkey (and left the column names alone)?

In this example,
what if we only redesigned the rowkey? After using the “thin”
rowkey the result is an HFile size that is 4.9 Mb down from the 6.2
Mb baseline, a 21% reduction. Not a small savings!

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATCOL

4,920,984

1000

NONE

Applying Snappy
compression can reduce the HFile size further:

1,295,895	1000	SNAPPY	DIFF
1,337,112	1000	SNAPPY	FAST_DIFF
1,489,446	1000	SNAPPY	PREFIX
1,739,871	1000	SNAPPY	NONE

However, note that
the resulting HFile size with Snappy and no data block encoding (1.7
Mb) is very similar in size to the baseline approach (i.e., fat
rowkeys, fat column-names) with Snappy and no data block encoding
(1.8 Mb). Why? The CF compression can compensate on disk for a lot
of bloat in rowkeys and column names.

Case #4: What if
we re-designed both the rowkey and the column names?

By this time we’ve
learned enough HBase to know that we need to have efficient rowkeys
and column-names. This produces an HFile that is 4.4 Mb, a 29%
savings over the baseline of 6.2 Mb.

4,406,418

1000

NONE

Applying Snappy
compression can reduce the HFile size further:

1,296,402	1000	SNAPPY	DIFF
1,338,135	1000	SNAPPY	FAST_DIFF
1,485,192	1000	SNAPPY	PREFIX
1,732,746	1000	SNAPPY	NONE

Again, the on-disk
footprint with compression isn’t radically different from the
others, as Compression can compensate to large degree for rowkey and
column name bloat.

Case #5:
KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)

What if we did
something radical and changed how we stored the data in HBase? With
this approach, we are using a single KeyValue per row holding all
of the columns of data for the row instead of a KeyValue per column
(the traditional way).

The resulting HFile,
even uncompressed and without Data Block Encoding, is radically
smaller at 1.4 Mb compared to 6.2 Mb.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-AVRO

1,374,465

1000

NONE

Adding Snappy
compression and Data Block Encoding makes the resulting HFile size
even smaller.

1,119,330	1000	SNAPPY	DIFF
1,129,209	1000	SNAPPY	FAST_DIFF
1,133,613	1000	SNAPPY	PREFIX
1,150,779	1000	SNAPPY	NONE

Compare the 1.1 Mb
Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin
column-name.

Summary

Although Compression
and Data Block Encoding can wallpaper over bad rowkey and column-name
decisions in terms of HFile size, you will pay the price for this in
terms of data transfer from RegionServer to Client. Also, concealing
the size penalty brings with it a performance penalty each time the
data is accessed or manipulated. So, the old advice about correctly
designing rowkeys and column names still holds.

In terms of KeyValue
approach, having a single KeyValue per row presents significant
savings both in terms of data transfer (RegionServer to Client) as
well as HFile size. However, there is a consequence with this
approach in having to update each row entirely, and that old
versions of rows also be stored in their entirety (i.e., as
opposed to column-by-column changes). Furthermore, it is impossible
to scan on select columns; the whole row must be retrieved and
deserialized to access any information stored in the row. The
importance of understanding this tradeoff cannot be over-stated, and
is something that must be evaluated on an application-by-application
basis.

Software engineering
is an art of managing tradeoffs, so there isn’t necessarily one
“best” answer. Importantly, this experiment only measures the
file size and not the time or processor load penalties imposed by the
use of compression, encoding, or Avro. The results generated in this
test are still based on certain assumptions and your mileage may
vary.

Here is the data if interested: http://people.apache.org/~dmeil/HBase_HFile_Size_2014_04.csv