Apache Sqoop Graduates from Incubator

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

In its monthly meeting in March of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache Sqoop, thus graduating it from the Incubator. This is a significant milestone in the life of Sqoop, which has come a long way since its inception almost three years ago. The following figure offers a brief overview of what has happened in the life of Sqoop so far:

Apache Sqoop: Timeline

Figure 1: A timeline of Sqoop Project

Sqoop started as a contrib module for Apache Hadoop in May of 2009, first submitted as a patch to HADOOP-5815 by Aaron Kimball. Over the course of next year, it saw about 56 patches submitted towards its development. Given the inertia of large projects, Aaron decided to decouple it from Hadoop and host it elsewhere to facilitate faster development and release cycles. Consequently, in April of 2010 Sqoop was taken out from Hadoop via MAPREDUCE-1644 and hosted on GitHub by Cloudera as an Apache Licensed project.

Over the course of next year, Sqoop saw wide adoption along with four releases and 191 patches. An extension API was introduced early in Sqoop that allowed the development of high-speed third party connectors for rapid data transfer from specialized systems such as enterprise data warehouses. As a result, multiple connectors were developed by various vendors that plugged into Sqoop. To bolster this fledgling community of users and third party connector vendors, Cloudera decided to propose it for incubation in Apache. Sqoop was accepted for incubation by the Apache Incubator in June of 2011.

Inside the Incubator, Sqoop saw a healthy growth in its community and gained four new committers. With active community and committers, Sqoop made two incubating releases. The focus of its first release was migration of code from com.cloudera.sqoop namespace to org.apache.sqoop while preserving backward compatibility. Thanks to phenomenal work by Bilung Lee, the release manager of the first incubating release, this release met all of its expectations. The second incubating release of Sqoop focused on its interoperability with various versions of Hadoop. The release manager of this release - Jarek Jarcec Cecho - was instrumental in making sure that it delivered to this requirement and could work with Hadoop versions 0.20, 0.23 and 1.0. Along with the stated goals of these incubating releases, Sqoop saw a steady growth with 116 patches by various contributors and committers. With excellent mentorship by Patrick Hunt, other mentors of the project, and from Incubator PMC members, Sqoop acquired the ability to self-govern, follow the ASF policies and guidelines, and, foster and grow the community.

Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache project. You can download its latest release artifacts by visiting http://sqoop.apache.org/.

While Sqoop has no doubt delivered significant value to the community of users, it is fair to say that it is in the early stages of fulfilling requirements of data integration around Hadoop. Work has started towards the development of next major revision of Sqoop which will address more of these requirements than before. Along the way, we are looking forward to grow the community many folds, get more committers on board, and solve some real challenging problems of data movement between Hadoop and external systems. We sincerely hope you will join us in taking Sqoop towards fulfilling all these goals and to become a standard component in Hadoop deployments everywhere.