The Apache Software Foundation Announces Apache™ Spark™ v1.0

Open Source large-scale, flexible, "Hadoop Swiss Army Knife" cluster computing framework offers enhanced data analysis and richer integration with other Apache projects

Forest Hill, MD –30 May 2014– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 170 Open Source projects and initiatives, announced today the availability of Apache Spark v1.0, the super-fast, Open Source large-scale data processing and advanced analytics engine.

Apache Spark has been dubbed a "Hadoop Swiss Army knife" for its remarkable speed and ease of use, allowing developers to quickly write applications in Java, Scala, or Python, using its built-in set of over 80 high-level operators. With Spark, programs can run up to 100x faster than Apache Hadoop MapReduce in memory.

"1.0 is a huge milestone for the fast-growing Spark community. Every contributor and user who's helped bring Spark to this point should feel proud of this release," said Matei Zaharia, Vice President of Apache Spark.

Apache Spark is well-suited for machine learning, interactive queries, and stream processing. It is 100% compatible with Hadoop's Distributed File System (HDFS), HBase, Cassandra, as well as any Hadoop storage system, making existing data immediately usable in Spark. In addition, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box.

New in v1.0, Apache Spark offers strong API stability guarantees (backward-compatibility throughout the 1.X series), a new Spark SQL component for accessing structured data, as well as richer integration with other Apache projects (Hadoop YARN, Hive, and Mesos).

Patrick Wendell, software engineer at Databricks and Apache Spark 1.0 release manager explained, "In addition to providing long-term stability for Spark's core APIs, this release contains a several new features. Spark 1.0 adds a unified submission tool for deploying applications on a local machine, Mesos, YARN, or a dedicated cluster. We've added a new module, Spark SQL, to provide schema-aware data modeling and SQL language support in Spark. Spark's machine learning library, MLLib, has been enhanced with several new algorithms. Spark’s streaming and graph libraries have also seen major updates. Across the board, we've focused on building tools to empower the data scientists, statisticians and engineers who must grapple with large data sets every day."

Spark was originally developed at UC Berkeley AMP Lab, and its ease of use has made it a go-to solution for both small and large enterprise environments across a wide range of industries, including Alibaba, ClearStory Data, Cloudera, Databricks, IBM, Intel, MapR, Ooyala, and Yahoo, among others. Not only are organizations rapidly adopting and deploying Apache Spark, many contributors are committing code to the project as well.

"Apache Spark is an important big data technology in delivering a high performance analytics solution for the IT industry and satisfying the fast-growing customer demand," said Michael Greene, Vice President and General Manager of System Technologies and Optimization at Intel. "Intel is proud to participate in its development and we congratulate the community on this release."

"At NASA, we're really excited to leverage Spark and its highly interactive analytic capabilities and the speedups offered by 1.0 along with Spark SQL are going to help out critical projects looking at measurement of Snow in the Western US and also on projects related to Regional Climate Modeling and in Model Evaluation for the U.S. National Climate Assessment related Activities," said Chris Mattmann, an ASF Director, Chief Architect, Instrument and Science Data Systems Section at NASA JPL, and Adjunct Associate Professor at the University of Southern California. "I'm looking forward to designing Spark-related projects in my Software Architectures and in my Search Engines courses at USC as well. The community is one of our most active at the ASF and the interest has really peaked and these guys are doing a great job."

"We're continuing to see very fast growth — 102 individuals have contributed patches to this release over the past four months, which is our highest number of contributors ever," added Zaharia.

Availability and Oversight

As with all Apache products, Apache Spark software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. For documentation and ways to become involved with Apache Spark, visit http://spark.apache.org/

About The Apache Software Foundation (ASF)

Established in 1999, the all-volunteer Foundation oversees more than one hundred and seventy leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 400 individual Members and 3,500 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Budget Direct, Citrix, Cloudera, Comcast, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, Matt Mullenweg, Microsoft, Pivotal, Produban, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ or follow @TheASF on Twitter.

"Apache", "Spark", "Apache Spark", and "ApacheCon" are trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.

# # #