Scaling Accumulo With Multi-Volume Support
This post was moved to the Accumulo project site.
MapReduce is a commonly used approach to
querying or analyzing large amounts of data. Typically MapReduce jobs
are created using using some set of files in HDFS to produce a
result. When new files come in, they get added to the set, and the
job gets run again. A common Accumulo approach to this scenario is to
load all of the data into a single instance of Accumulo.
A single instance of Accumulo can scale quite
largely[1,2] to accommodate high levels of ingest and query. The manner
in which ingest is performed typically depends on latency
requirements. When the desired latency is small, inserts are
performed directly into Accumulo. When the desired latency is allowed
to be large, then a bulk style of ingest[3] can be used. There are
other factors to consider as well, but they are outside the scope of
this article.
On large clusters using the bulk style of ingest
input files are typically batched into MapReduce jobs to create a set
of output RFiles for import into Accumulo. The number of files per
job is typically determined by the required latency and the number of
MapReduce tasks that the cluster can complete in the given
time-frame. The resulting RFiles, when imported into Accumulo, are
added to the list of files for their associated tablets. Depending on
the configuration this will cause Accumulo to major compact these
tablets. If the configuration is tweaked to allow more files per
tablet, to reduce the major compactions, then more files need to be
opened at query time when performing scans on the tablet. Note that
no single node is burdened by the file management; but, the number of
file operations in aggregate is very large. If each server has
several hundred tablets, and there are a thousand tablet servers, and
each tablet compacts some files every few imports, we easily have
50,000 file operations (create, allocate a block, rename and delete)
every ingest cycle.
In addition to the NameNode operations caused by
bulk ingest, other Accumulo processes (e.g. master, gc) require
interaction with the NameNode. Single processes, like the garbage
collector, can be starved of responses from the NameNode as the NameNode is
limited on the number of concurrent operations. It is not unusual for
an operator's request for “hadoop
fs -ls /accumulo” to take a minute before returning results
during the peak file-management periods. In particular, the file
garbage collector can fall behind, not finishing a cycle of
unreferenced file removal before the next ingest cycle creates a new
batch of files to be deleted.
The Hadoop community addressed the NameNode
bottleneck issue with HDFS federation[4] which allows a datanode to
serve up blocks for multiple namenodes. Additionally, ViewFS allows
clients to communicate with multiple namenodes through the use of a
client-side mount table. This functionality was insufficient for
Accumulo in the 1.6.0 release as ViewFS works at a directory level; as an example, /dirA is mapped to
one NameNode and /dirB is mapped to another, and Accumulo uses a
single HDFS directory for its storage.
Multi-Volume support (MVS), included in 1.6.0,
includes the changes that allow Accumulo to work across multiple HDFS
clusters (called volumes in Accumulo) while continuing to use a
single HDFS directory. A new property, instance.volumes, can be
configured with multiple HDFS nameservices and Accumulo will use them
all to balance out NameNode operations. The nameservices configured
in instance.volumes may optionally use the High Availability NameNode feature as it is transparent
to Accumulo. With MVS you have two options to horizontally scale your
Accumulo instance. You can use an HDFS cluster with Federation and
multiple NameNodes or you can use separate HDFS clusters.
By default Accumulo will perform round-robin file
allocation for each tablet, spreading the files across the different
volumes. The file balancer is pluggable, allowing for custom
implementations. For example, if you don't use Federation and use
multiple HDFS clusters, you may want to allocate all files for a
particular table to one volume.
Comments in the JIRA[5] regarding backups could
lead to follow-on work. With the inclusion of snapshots in HDFS, you
could easily envision an application that quiesces the database or
some set of tables, flushes their entries from memory, and snapshots
their directories. These snapshots could then be copied to another
HDFS instance either for an on-disk backup, or bulk-imported into
another instance of Accumulo for testing or some other use.
The example configuration below shows how to
set up Accumulo with HA NameNodes and Federation, as it is likely the
most complex. We had to reference several web sites, one of the HDFS
mailing lists, and the source code to find all of the configuration
parameters that were needed. The configuration below includes two
sets of HA namenodes, each set servicing an HDFS nameservice in a
single HDFS cluster. In the example below, nameserviceA is serviced
by name nodes 1 and 2, and nameserviceB is serviced by name nodes 3
and 4.
[1]
http://ieeexplore.ieee.org/zpl/login.jsp?arnumber=6597155
[2]
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
[3]
http://accumulo.apache.org/1.6/examples/bulkIngest.html
[4]
https://issues.apache.org/jira/browse/HDFS-1052
[5]
https://issues.apache.org/jira/browse/ACCUMULO-118
- By Dave Marion and Eric Newton
core-site.xml:
fs.defaultFS viewfs:/// fs.viewfs.mounttable.default.link./nameserviceA hdfs://nameserviceA fs.viewfs.mounttable.default.link./nameserviceB hdfs://nameserviceB fs.viewfs.mounttable.default.link./nameserviceA/accumulo/instance_id hdfs://nameserviceA/accumulo/instance_id Workaround for ACCUMULO-2719 dfs.ha.fencing.methods sshfence(hdfs:22) shell(/bin/true)dfs.ha.fencing.ssh.private-key-files dfs.ha.fencing.ssh.connect-timeout 30000 ha.zookeeper.quorum zkHost1:2181,zkHost2:2181,zkHost3:2181
hdfs-site.xml:
dfs.nameservices nameserviceA,nameserviceB dfs.ha.namenodes.nameserviceA nn1,nn2 dfs.ha.namenodes.nameserviceB nn3,nn4 dfs.namenode.rpc-address.nameserviceA.nn1 host1:8020 dfs.namenode.rpc-address.nameserviceA.nn2 host2:8020 dfs.namenode.http-address.nameserviceA.nn1 host1:50070 dfs.namenode.http-address.nameserviceA.nn2 host2:50070 dfs.namenode.rpc-address.nameserviceB.nn3 host3:8020 dfs.namenode.rpc-address.nameserviceB.nn4 host4:8020 dfs.namenode.http-address.nameserviceB.nn3 host3:50070 dfs.namenode.http-address.nameserviceB.nn4 host4:50070 dfs.namenode.shared.edits.dir.nameserviceA.nn1 qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceA dfs.namenode.shared.edits.dir.nameserviceA.nn2 qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceA dfs.namenode.shared.edits.dir.nameserviceB.nn3 qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceB dfs.namenode.shared.edits.dir.nameserviceB.nn4 qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceB dfs.client.failover.proxy.provider.nameserviceA org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.client.failover.proxy.provider.nameserviceB org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.automatic-failover.enabled.nameserviceA true dfs.ha.automatic-failover.enabled.nameserviceB true
accumulo-site.xml:
instance.volumes hdfs://nameserviceA/accumulo,hdfs://nameserviceB/accumulo