Scaling Accumulo With Multi-Volume Support

This post was moved to the Accumulo project site.

MapReduce is a commonly used approach to
querying or analyzing large amounts of data. Typically MapReduce jobs
are created using using some set of files in HDFS to produce a
result. When new files come in, they get added to the set, and the
job gets run again. A common Accumulo approach to this scenario is to
load all of the data into a single instance of Accumulo.

A single instance of Accumulo can scale quite
largely[1,2] to accommodate high levels of ingest and query. The manner
in which ingest is performed typically depends on latency
requirements. When the desired latency is small, inserts are
performed directly into Accumulo. When the desired latency is allowed
to be large, then a bulk style of ingest[3] can be used. There are
other factors to consider as well, but they are outside the scope of
this article.

On large clusters using the bulk style of ingest
input files are typically batched into MapReduce jobs to create a set
of output RFiles for import into Accumulo. The number of files per
job is typically determined by the required latency and the number of
MapReduce tasks that the cluster can complete in the given
time-frame. The resulting RFiles, when imported into Accumulo, are
added to the list of files for their associated tablets. Depending on
the configuration this will cause Accumulo to major compact these
tablets. If the configuration is tweaked to allow more files per
tablet, to reduce the major compactions, then more files need to be
opened at query time when performing scans on the tablet. Note that
no single node is burdened by the file management; but, the number of
file operations in aggregate is very large. If each server has
several hundred tablets, and there are a thousand tablet servers, and
each tablet compacts some files every few imports, we easily have
50,000 file operations (create, allocate a block, rename and delete)
every ingest cycle.

In addition to the NameNode operations caused by
bulk ingest, other Accumulo processes (e.g. master, gc) require
interaction with the NameNode. Single processes, like the garbage
collector, can be starved of responses from the NameNode as the NameNode is
limited on the number of concurrent operations. It is not unusual for
an operator's request for “hadoop
fs -ls /accumulo” to take a minute before returning results
during the peak file-management periods. In particular, the file
garbage collector can fall behind, not finishing a cycle of
unreferenced file removal before the next ingest cycle creates a new
batch of files to be deleted.

The Hadoop community addressed the NameNode
bottleneck issue with HDFS federation[4] which allows a datanode to
serve up blocks for multiple namenodes. Additionally, ViewFS allows
clients to communicate with multiple namenodes through the use of a
client-side mount table. This functionality was insufficient for
Accumulo in the 1.6.0 release as ViewFS works at a directory level; as an example, /dirA is mapped to
one NameNode and /dirB is mapped to another, and Accumulo uses a
single HDFS directory for its storage.

Multi-Volume support (MVS), included in 1.6.0,
includes the changes that allow Accumulo to work across multiple HDFS
clusters (called volumes in Accumulo) while continuing to use a
single HDFS directory. A new property, instance.volumes, can be
configured with multiple HDFS nameservices and Accumulo will use them
all to balance out NameNode operations. The nameservices configured
in instance.volumes may optionally use the High Availability NameNode feature as it is transparent
to Accumulo. With MVS you have two options to horizontally scale your
Accumulo instance. You can use an HDFS cluster with Federation and
multiple NameNodes or you can use separate HDFS clusters.

By default Accumulo will perform round-robin file
allocation for each tablet, spreading the files across the different
volumes. The file balancer is pluggable, allowing for custom
implementations. For example, if you don't use Federation and use
multiple HDFS clusters, you may want to allocate all files for a
particular table to one volume.

Comments in the JIRA[5] regarding backups could
lead to follow-on work. With the inclusion of snapshots in HDFS, you
could easily envision an application that quiesces the database or
some set of tables, flushes their entries from memory, and snapshots
their directories. These snapshots could then be copied to another
HDFS instance either for an on-disk backup, or bulk-imported into
another instance of Accumulo for testing or some other use.

The example configuration below shows how to
set up Accumulo with HA NameNodes and Federation, as it is likely the
most complex. We had to reference several web sites, one of the HDFS
mailing lists, and the source code to find all of the configuration
parameters that were needed. The configuration below includes two
sets of HA namenodes, each set servicing an HDFS nameservice in a
single HDFS cluster. In the example below, nameserviceA is serviced
by name nodes 1 and 2, and nameserviceB is serviced by name nodes 3
and 4.

[1]
http://ieeexplore.ieee.org/zpl/login.jsp?arnumber=6597155

[2]
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

[3]
http://accumulo.apache.org/1.6/examples/bulkIngest.html

[4]
https://issues.apache.org/jira/browse/HDFS-1052

[5]
https://issues.apache.org/jira/browse/ACCUMULO-118

- By Dave Marion and Eric Newton

core-site.xml:

  
    fs.defaultFS
    viewfs:///
  
  
    fs.viewfs.mounttable.default.link./nameserviceA
    hdfs://nameserviceA
  
  
    fs.viewfs.mounttable.default.link./nameserviceB
      hdfs://nameserviceB
  
  
    fs.viewfs.mounttable.default.link./nameserviceA/accumulo/instance_id
      hdfs://nameserviceA/accumulo/instance_id
    Workaround for ACCUMULO-2719
  
  
    dfs.ha.fencing.methods
    sshfence(hdfs:22)      
           shell(/bin/true)
  
     
    dfs.ha.fencing.ssh.private-key-files
      
  
  
    dfs.ha.fencing.ssh.connect-timeout
    30000
  
  
    ha.zookeeper.quorum   
    zkHost1:2181,zkHost2:2181,zkHost3:2181

hdfs-site.xml:

  
    dfs.nameservices
    nameserviceA,nameserviceB
  
  
    dfs.ha.namenodes.nameserviceA
    nn1,nn2
  
   
    dfs.ha.namenodes.nameserviceB
    nn3,nn4
  
  
    dfs.namenode.rpc-address.nameserviceA.nn1
    host1:8020
  
  
    dfs.namenode.rpc-address.nameserviceA.nn2
    host2:8020
  
  
    dfs.namenode.http-address.nameserviceA.nn1
    host1:50070
  
  
    dfs.namenode.http-address.nameserviceA.nn2
    host2:50070
  
  
    dfs.namenode.rpc-address.nameserviceB.nn3
    host3:8020
  
  
    dfs.namenode.rpc-address.nameserviceB.nn4
    host4:8020
  
   
    dfs.namenode.http-address.nameserviceB.nn3
    host3:50070
  
  
    dfs.namenode.http-address.nameserviceB.nn4
    host4:50070
  
   
    dfs.namenode.shared.edits.dir.nameserviceA.nn1
    qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceA
  
  
    dfs.namenode.shared.edits.dir.nameserviceA.nn2   
    qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceA
  
  
    dfs.namenode.shared.edits.dir.nameserviceB.nn3
    qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceB
  
  
    dfs.namenode.shared.edits.dir.nameserviceB.nn4
    qjournal://jHost1:8485;jHost2:8485;jHost3:8485/nameserviceB
  
  
    dfs.client.failover.proxy.provider.nameserviceA
    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
  
   
    dfs.client.failover.proxy.provider.nameserviceB
    org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
  
  
    dfs.ha.automatic-failover.enabled.nameserviceA
    true
  
  
    dfs.ha.automatic-failover.enabled.nameserviceB
    true

accumulo-site.xml:

  
    instance.volumes   
    hdfs://nameserviceA/accumulo,hdfs://nameserviceB/accumulo