Ever since Apache Bigtop entered an incubation, we've been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care. The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that "Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem". That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation's (ASF) Hadoop ecosystem projects, yet it doesn't really help you understand the aspirations of Bigtop that go beyond what the ASF has traditionally done.

Software Projects vs. Software Products vs. Software Distributions (Stacks)

Perhaps it comes as a surprise that the core mission of the ASF is not to churn out code, and it certainly is not to produce software products, but rather to foster open source communities. The "communities over code" motto sums it up fairly well, but, of course it doesn't mean that code is unimportant. After all, those communities congregate around code bases and at the end of the day, the tangible artifact produced by each of these communities is a source release of a software project. Typically, however, it is not a software product.

Imagine, for a moment, a world in which deploying Apache HTTP Server means:

  • Downloading the source release tarball
  • Setting up a build environment
  • Configuring the static portion of the build
  • Building from the source
  • Figuring out all the interface points between the rest of your software stack and HTTPD
  • Maintaining all of the above across your data center

Sounds quaint, right? Sounds like something you might have done a decade ago, but that you don't have to suffer through anymore. In fact, it sounds exactly like you'd rather view Apache HTTP Server as a software product that gets delivered to you by a vendor, rather than an ASF software project that you can participate in.

Moreover, if you consider that because Apache HTTP Server doesn't exist in a vacuum, it needs to:

  • Integrate well with the rest of the system at build time. For example, which version of libc and Linux kernel do you build it against?
  • Integrate well with the rest of the system during the deployment time. For example, which version of libc and Linux kernel do you run it against? Are these versions the same as in the previous bullet item?
  • Behave in a meaningful fashion when you are trying to upgrade from a previous version to the next one, while also trying to retain all of the data and minimizing the downtime. For example, did the location of the document root change between the releases?
  • And so on...

It becomes apparent that each individual software project, however well released, maintained, and unit tested, is but the tip of a much bigger iceberg called a "software distribution" or a "software stack".

A software stack is an ultimate bundle that can be given to a customer in the form of a set of packages, or these days, a virtual machine image. It is presumed that each individual software project in such a bundle has been very carefully selected and optimized to work with the rest of software projects from the same distribution. In short, a software stack is defined by a bill of materials specifying exact names, versions, and patches for each of the software components. It is further assumed that the bundle has been carefully validated as a whole in a variety of different deployment scenarios (unit testing alone is insufficient).

If all this sounds like a lot of work and a huge burden to each individual project to shoulder alone, that is because we are used to software vendors doing this work and simply giving us the end result (sometimes for free, sometimes not).

The Role of a Software Vendor

There are all sorts of software vendors that work with ASF software. They range from huge, intrinsically commercial, system companies such as IBM, to ostensibly free and open source organizations such as the Debian Project. They have little in common except for one thing: they all strive to satisfy their users by providing the best type of integration between different bits and pieces of software that they utilize. The have a systems view and their primary responsibility is to build a complete, fully functioning, and validated software stack.

Given that, historically, open source OS vendors have done a really nice job of integrating ASF software with the rest of their systems, it is not surprising that the foundation itself has never really had any interest in playing that role. That is, until it started to embrace more and more projects implemented in Java. And to make matters even more complicated, those projects are Java-based distributed systems.

Java in the Systems World

Considering the success Java has enjoyed, it is hard to remember that it was envisioned as a small platform for building embedded applications. There was little to no effort put into integrating it with the rest of the system, simply because it was the system. When Java graduated to a system-implementation platform, various efforts (OSGi and Project Jigsaw being the most notable) have sprung up trying to address integration and packaging issues: modularity, multi-versioning, and dependency tracking. None of them really succeeded. At least not yet. Java software has always been somewhat of a second-class citizen in Unix environments, and did not get much benefit from a rich experience of integrating native applications with the rest of the system.

But today, the attitude of OS vendors towards Java is slowly changing. With the recent development of some very reasonable, system-wide guidelines, Java applications may soon stand shoulder-to-shoulder with their C and C++ brethren in all of the OS bundles. Until then, an approach where ASF delegates system-level integration responsibilities to the OSVs or ISVs is simply not available to most of its Java-based projects.

The Apache Hadoop Ecosystem

Apache Hadoop and its ecosystem projects are sophisticated distributed systems written in Java. Despite the fact that most of the ASF communities working on them tend to practice a high degree of software development discipline (for example, continuous integration and rigorous attention to unit tests), there has always been a very significant gap when it came to the inter-project and OS-level integration. The model that works perfectly well for Apache HTTP Server, where the ASF would delegate all of the integration efforts to various OS vendors, has not been an option for the Apache Hadoop ecosystem. OS vendors simply haven't had any experience in dealing with that type of software system. And at least the open source ones were not exactly interested in changing that status quo.

Of course, given the importance of Hadoop as a platform, it wasn't long before commercial ISVs started to fill that gap. Cloudera, with its Cloudera’s Distribution Including Apache Hadoop, was the first company to deliver a fully integrated, pre-packaged, and extensively validated software stack based on all of the ASF Projects. Others soon followed.

Different Strokes for Different Folks

Just as the question, "What is Debian?" will get different answers depending on whether you're asking Richard Stallman, Ian Murdock, Mark Shuttleworth, or me, the answer for Bigtop depends on the constituency:

  • For a casual user (a big data hacker): Bigtop provides a fully integrated, packaged, and validated stack of big data management software based on the Apache Hadoop ecosystem specially tailored for your favorite version of Linux OS (and perhaps other OS's in the future). The packaged artifacts and the deployment experience will be very similar to CDH, but with two key exceptions:
    1. Unlike CDH, which offers a curation of functionality via selectively backported patches in order to provide the best experience for our customers, a Bigtop distribution simply builds from the very same source code that was released by various upstream projects.
    2. Unlike CDH, which puts a special emphasis on stability and backwards compatibility, a Bigtop distribution will be much more aggressive in tracking the very latest versions of Hadoop ecosystem components.
  • For the Apache Software Foundation: Bigtop will be the first project where the foundation itself becomes its own ISV. The idea is to provide a place where all of the developers interested in system and integration aspects of Hadoop can collaborate to define the next generation model of delivering distributed big data management systems. This should be especially exciting for the developers of various Hadoop ecosystem projects, since it has a strong potential for helping them with all system-level integration activities ranging from integration testing to the maintenance of the OS-specific code. After all, it is hardly a benefit to anybody that there's no code sharing for features such as service management and log rotation.
  • For OS vendors: Bigtop provides a readily available source of packaging, validation, and deployment code that can be used as a basis for integration of Apache Hadoop into the OS bundles. Ubuntu is the first major Linux distribution looking to reuse Bigtop code. Bigtop is currently focused on Linux-based OS's, but that could easily change in the future given the open nature of the project.
  • For various ISVs in the big data space extending the Hadoop platform: Bigtop could serve as the de-facto standard for validating and publishing custom software packages.
  • For all of the companies building their own distributions including Apache Hadoop: Bigtop could be a place of collaboration on the common platform and a treasure trove of wheels that don't need to be reinvented. Here's the list of just a few that already do that.

The Anatomy of Bigtop

Fundamentally, each Bigtop release delivers the following artifacts:

There's also a fair amount of infrastructure that the project has to maintain on its own in addition to the usual ASF Jenkins builds. Since Bigtop must have access to a diverse set of Linux distributions and configurations for builds and testing, we maintain our own Jenkins server in the Amazon's EC2 cloud (graciously sponsored by Cloudera). Our continuous integration builds on that server make deploying the nightly snapshot of a bleeding edge Bigtop distribution a simple matter of choosing the right repo file (for example,Fedora16, SLES, Lucid, and others) and following simple instructions a few simple steps. And the best part is that all of these repos are continuously validated by our package tests and deployed using our puppet code, making Bigtop constantly eat its own dog food.

If all of this sounds interesting to you -- get involved and also make sure you browse through our project website. The next release of Bigtop (0.4.0) will be based on the next generation Hadoop (YARN) 2.0, which makes this an exciting time to join the Bigtop project. If you happen to be in San Francisco Bay Area, you can meet Bigtop developers at our bi-weekly study groups and hackathons.

Parting Thoughts

Apache Bigtop (incubating) is still a very young project. We have some ambitious goals in mind, but we can't possibly achieve them without your help. We need your feedback and we need your involvement. There's great inspiration in how the Linux community built projects such as Debian and most recently Yocto. The question is not whether it can be done, but rather whether ASF and big data communities care to have a Hadoop-base big data platform governed by the principles of meritocracy and openness.

In fact, there are great parallels between how the Linux Foundation has championed the Yocto project, and how the Apache Foundation could possibly do the same with Bigtop. At the end of the day, a common platform on which different (sometimes competing!) companies collaborate for the benefit of end users simply makes sense. As Linus would say – this is what world domination is all about.