The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size
By Doug Meil, HBase Committer and Thomas Murphy
Intro
One of the most
common questions in the HBase user community is estimating disk
footprint of tables, which translates into HFile size – the
internal file format in HBase.
We designed an
experiment at Explorys where we ran combinations of design time
options (rowkey length, column name length, row storage approach) and
runtime options (HBase ColumnFamily compression, HBase data block
encoding options) to determine these factors’ effects on the
resultant HFile size in HDFS.
HBase Environment
CDH4.3.0 (HBase
0.94.6.1)
Design Time
Choices
-
Rowkey
-
Thin
-
16-byte MD5
hash of an integer.
-
-
Fat
-
64-byte
SHA-256 hash of an integer.
-
-
-
Note: neither
of these are realistic rowkeys for real applications, but they
chosen because they are easy to generate and one is a lot bigger
than the other.
-
Column Names
-
Thin
-
2-3 character
column names (c1, c2).
-
-
Fat
-
10
characters, randomly chosen but consistent for all rows.
-
-
-
Note: it is
advisable to have small column names, but most people don’t start
that way so we have this as an option.
-
Row Storage
Approach-
Key Value Per
Column-
This is the
traditional way of storing data in HBase.
-
-
One Key Value
per row.-
Actually,
two. -
One KV has an
Avro serialized byte array containing all the data from the row. -
Another KV
holds an MD5 hash of the version of the Avro schema.
-
-
Run Time
-
Column
Family Compression-
None
-
GZ
-
LZ4
-
LZO
-
Snappy
-
-
Note: it is
generally advisable to use compression, but what if you didn’t?
So we tested that too.
-
HBase Block
Encoding-
None
-
Prefix
-
Diff
-
Fast Diff
-
-
Note: most
people aren’t familiar with HBase Data Block Encoding. Primarily
intended for squeezing more data into the block cache, it has
effects on HFile size too. See HBASE-4218 for more detail.
1000 rows were
generated for each combination of table parameters. Not a ton of
data, but we don’t necessarily need a ton of data to see the
varying size of the table. There were 30 columns per row comprised
of 10 strings (each filled with 20 bytes of random characters), 10
integers (random numbers) and 10 longs (also random numbers).
HBase blocksize was
128k.
Results
The easiest way to
navigate the results is to compare specific cases, progressing from
an initial implementation of a table to options for production.
Case #1: Fat
Rowkey and Fat Column Names, Now What?
This is where most
people start with HBase. Rowkeys are not as optimal as they should
be (i.e., the Fat rowkey case) and column names are also inflated
(Fat column-names).
Without CF
Compression or Data Block Encoding, the baseline is:
psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY-FATCOL |
6,293,670 |
1000 |
NONE |
NONE |
What if we just
changed CF compression?
This drastically
changes the HFile footprint. Snappy compression reduces the HFile
size from 6.2 Mb to 1.8 Mb, for example.
1,362,033 |
1000 |
GZ |
NONE |
1,803,240 |
1000 |
SNAPPY |
NONE |
1,919,265 |
1000 |
LZ4 |
NONE |
1,950,306 |
1000 |
LZO |
NONE |
However, we
shouldn’t be too quick to celebrate. Remember that this is
just the disk footprint. Over the wire the data is
uncompressed, so 6.2 Mb is still being transferred from RegionServer
to Client when doing a Scan over the entire table.
What if we just
changed data block encoding?
Compression isn’t
the only option though. Even without compression, we can change the
data block encoding and also achieve HFile reduction. All options
are an improvement over the 6.2 Mb baseline.
1,491,000 |
1000 |
NONE |
DIFF |
1,492,155 |
1000 |
NONE |
FAST_DIFF |
2,244,963 |
1000 |
NONE |
PREFIX |
Combination
The following table
shows the results of all remaining CF compression / data block
encoding combinations.
1,146,675 |
1000 |
GZ |
DIFF |
1,200,471 |
1000 |
GZ |
FAST_DIFF |
1,274,265 |
1000 |
GZ |
PREFIX |
1,350,483 |
1000 |
SNAPPY |
DIFF |
1,358,190 |
1000 |
LZ4 |
DIFF |
1,391,016 |
1000 |
SNAPPY |
FAST_DIFF |
1,402,614 |
1000 |
LZ4 |
FAST_DIFF |
1,406,334 |
1000 |
LZO |
FAST_DIFF |
1,541,151 |
1000 |
SNAPPY |
PREFIX |
1,597,440 |
1000 |
LZO |
PREFIX |
1,622,313 |
1000 |
LZ4 |
PREFIX |
Case #2: What if
we re-designed the column names (and left the rowkey alone)?
Let’s assume that
we re-designed our column names but left the rowkey alone. After
using the “thin” column-names without CF compression or data
block encoding it results in an HFile 5.8 Mb in size. This is an
improvement from the original 6.2 Mb baseline. It doesn’t seem
like much, but it’s still a 6.5% reduction in the eventual
wire-transfer footprint.
psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY |
5,778,888 |
1000 |
NONE |
NONE |
Applying Snappy
compression can reduce the HFile size further:
1,349,451 |
1000 |
SNAPPY |
DIFF |
1,390,422 |
1000 |
SNAPPY |
FAST_DIFF |
1,536,540 |
1000 |
SNAPPY |
PREFIX |
1,785,480 |
1000 |
SNAPPY |
NONE |
Case #3: What if
we re-designed the rowkey (and left the column names alone)?
In this example,
what if we only redesigned the rowkey? After using the “thin”
rowkey the result is an HFile size that is 4.9 Mb down from the 6.2
Mb baseline, a 21% reduction. Not a small savings!
psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATCOL |
4,920,984 |
1000 |
NONE |
NONE |
Applying Snappy
compression can reduce the HFile size further:
1,295,895 |
1000 |
SNAPPY |
DIFF |
1,337,112 |
1000 |
SNAPPY |
FAST_DIFF |
1,489,446 |
1000 |
SNAPPY |
PREFIX |
1,739,871 |
1000 |
SNAPPY |
NONE |
However, note that
the resulting HFile size with Snappy and no data block encoding (1.7
Mb) is very similar in size to the baseline approach (i.e., fat
rowkeys, fat column-names) with Snappy and no data block encoding
(1.8 Mb). Why? The CF compression can compensate on disk for a lot
of bloat in rowkeys and column names.
Case #4: What if
we re-designed both the rowkey and the column names?
By this time we’ve
learned enough HBase to know that we need to have efficient rowkeys
and column-names. This produces an HFile that is 4.4 Mb, a 29%
savings over the baseline of 6.2 Mb.
4,406,418 |
1000 |
NONE |
NONE |
Applying Snappy
compression can reduce the HFile size further:
1,296,402 |
1000 |
SNAPPY |
DIFF |
1,338,135 |
1000 |
SNAPPY |
FAST_DIFF |
1,485,192 |
SNAPPY |
PREFIX |
|
1,732,746 |
1000 |
SNAPPY |
NONE |
Again, the on-disk
footprint with compression isn’t radically different from the
others, as Compression can compensate to large degree for rowkey and
column name bloat.
Case #5:
KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)
What if we did
something radical and changed how we stored the data in HBase? With
this approach, we are using a single KeyValue per row holding all
of the columns of data for the row instead of a KeyValue per column
(the traditional way).
The resulting HFile,
even uncompressed and without Data Block Encoding, is radically
smaller at 1.4 Mb compared to 6.2 Mb.
psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-AVRO |
1,374,465 |
1000 |
NONE |
NONE |
Adding Snappy
compression and Data Block Encoding makes the resulting HFile size
even smaller.
1,119,330 |
1000 |
SNAPPY |
DIFF |
1,129,209 |
1000 |
SNAPPY |
FAST_DIFF |
1,133,613 |
1000 |
SNAPPY |
PREFIX |
1,150,779 |
1000 |
SNAPPY |
NONE |
Compare the 1.1 Mb
Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin
column-name.
Summary
Although Compression
and Data Block Encoding can wallpaper over bad rowkey and column-name
decisions in terms of HFile size, you will pay the price for this in
terms of data transfer from RegionServer to Client. Also, concealing
the size penalty brings with it a performance penalty each time the
data is accessed or manipulated. So, the old advice about correctly
designing rowkeys and column names still holds.
In terms of KeyValue
approach, having a single KeyValue per row presents significant
savings both in terms of data transfer (RegionServer to Client) as
well as HFile size. However, there is a consequence with this
approach in having to update each row entirely, and that old
versions of rows also be stored in their entirety (i.e., as
opposed to column-by-column changes). Furthermore, it is impossible
to scan on select columns; the whole row must be retrieved and
deserialized to access any information stored in the row. The
importance of understanding this tradeoff cannot be over-stated, and
is something that must be evaluated on an application-by-application
basis.
Software engineering
is an art of managing tradeoffs, so there isn’t necessarily one
“best” answer. Importantly, this experiment only measures the
file size and not the time or processor load penalties imposed by the
use of compression, encoding, or Avro. The results generated in this
test are still based on certain assumptions and your mileage may
vary.
Here is the data if interested: http://people.apache.org/~