20x faster mapreduce with gridgain-hadoop accelerator
Just last year, Gridgain has opened up their data fabric platform under ASL2.0. Short after, we have added a new in-memory component (aka Hadoop Accelerator) into Apache bigdata stack that provides two major features:
- in-memory HDFS caching
- in-memory MapReduce, transparently for any existing MapReduce code, including Hive generated queries
With usual Bigtop focus on user experience, the setup and configuration are seamless and quick. Yet results are pretty tremendous, as you'll see in a few moments.
The Accelerator will be soon be released as a part of official 0.9 Bigtop release. For now, you can quickly build the packages for your system by simply running
% gradlew gridgain-hadoop-[apt|yum]
on the current Bigtop trunk. You can do it easily with
-
git clone https://git-wip-us.apache.org/repos/asf/bigtop.git && cd bigtop
-
puppet apply --modulepath=`pwd` -e "include bigtop_toolchain::installer"
-
gradlew tasks
Once packages are built you can follow the deployment process, and only set "gridgain-hadoop" in the list of the desired components, if you aren't interested in HDFS or anything else.
In our case, we also be installing hadoop & yarn to have some ready MR examples for comparative performance test. Now you should have your cluster running the Accelerator in fully distributed mode. Time for test. During the deployment, new Hadoop client configurations were generated under /etc/hadoop/gridgain.client.conf. You will need to point your client there using HADOOP_CONF_DIR environment variable. Let's run the job:
HADOOP_CONF_DIR=/etc/hadoop/gridgain.client.conf hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 200 200
This command will calculate Pi using Quasi MonteCarlo method by running 200 maps with 200 samples per map. If your cluster is similar to mine - 3 2-core VM nodes, 8 GB RAM each, then you'll see the result in about 7 seconds from start to finish. Pi is estimated to be 3.1414. Close enough!
Now, let's run the same task but using YARN scheduler with a classic MapReduce application. Same as above, but this time let's use standard Hadoop configuration:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 200 200
We got the same Pi estimation, which is expected considering that exactly the same code was ran, but the time is very different. The execution took 143 seconds or about 20 times longer.
Thanks for great implementation of completely in-memory MapReduce algorithm we were able to get many fold performance improvement without changing a signle line of the application code. Same approach works for Hive queries, where one can make Hive to work with Gridgain Accelerator after as much as supplying a trivial client-side hive-site.xml configuration file.
One might argue that similar performance gains could perhaps be achieved by using other in-memory technologies like Spark. True. Yet with Spark you'll have to learn a new programming paradigm (and possible new language) and rewrite all your existing code to run on Spark cluster. Here you just point your client to a different set of configurations. By the way, Bigtop was first to recognize the value of Spark and started offering it as a part of our 0.6.0 release in early 2012 - well before any commercial vendors jumped on the band wagon. So, I am not bashing Spark in any way.
The great news: Hadoop Accelerator and the full data fabric platform is now available as Apache Ignite (incubating) project. As we are tilting more and more towards in-memory data processing stack, we'll be adding the Ignite platform into Bigtop once their first release is out.