Open source Big Data in-memory columnar layer accelerates analytical processing and interchange by more than 100x.
Forest Hill, MD --17 Feb 2016-- The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today Apache Arrow as a new Top-Level Project.
A high-performance cross-system data layer for columnar in-memory analytics, Apache Arrow provides the following benefits for Big Data workloads:
- Accelerates the performance of analytical workloads by more than 100x in some cases
- Enables multi-system workloads by eliminating cross-system communication overhead
Initially seeded by code from the Apache Drill project, Apache Arrow was built on top of a number of Open Source collaborations, and establishes a de-facto standard for columnar in-memory processing and interchange.
"The Open Source community has joined forces on Apache Arrow," said Jacques Nadeau, Vice President of Apache Arrow and Vice President Apache Drill. "Developers from 13 major Open Source Big Data projects are already on board --by introducing a new era of columnar in-memory analytics, we anticipate the majority of the world's data will be processed through Arrow within the next few years."
Code committers to Apache Arrow include developers from Apache Big Data projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark, and Storm as well as established and emerging Open Source projects such as Pandas and Ibis.
"Arrow's cross platform and cross system strengths will enable Python and R to become first-class languages across the entire Big Data stack," said Wes McKinney, creator of Pandas.
Apache Arrow accelerates analytical processing by providing a high performance columnar in-memory representation. A number of processing algorithms benefit greatly from this memory design.
"A columnar in-memory data layer enables systems and applications to process data at full hardware speeds," said Todd Lipcon, original Apache Kudu creator and member of the Apache Arrow Project Management Committee. "Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."
In many workloads, 70-80% of CPU cycles are spent serializing and deserializing data. Arrow solves this problem by enabling data to be shared between systems and processes with no serialization, deserialization or memory copies.
"An industry-standard columnar in-memory data layer enables users to combine multiple systems, applications and programming languages in a single workload without the usual overhead," said Ted Dunning, Vice President of the Apache Incubator and member of the Apache Arrow Project Management Committee.
In addition to traditional relational data, Arrow supports complex data with dynamic schemas. For example, Arrow can handle JSON data which is commonly used in IoT workloads, modern applications and log files. Implementations are also available (or underway) for a number of programming languages including Java, C++ and Python to allow greater interoperability among a number of Big Data solutions.
"Real world use cases often include complex combinations of structured and rapidly growing complex-data. Already tested with Apache Drill, the efficient in-memory columnar representation and processing in Arrow will enable users to enjoy the performance of columnar processing with the flexibility of JSON," said Parth Chandra, member of the Apache Drill and Apache Arrow Project Management Committees.
Catch Apache Arrow in action at Strata + Hadoop World (San Jose: 30 March 2016, and London: 1-3 June 2016), as well as upcoming MeetUps and local events http://arrow.apache.org/events
Availability and Oversight
Apache Arrow software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Arrow, visit http://arrow.apache.org/
About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Matt Mullenweg, Microsoft, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/
or follow @TheASF
© The Apache Software Foundation. "Apache", "Apache Arrow", "Arrow", "Apache Calcite", "Calcite", "Apache Cassandra", "Cassandra", "Apache Drill", "Drill", "Apache Hadoop", "Hadoop", "Apache HBase", "HBase", "Apache Impala", "Impala", "Apache Kudu (incubating)", "Kudu (incubating)", "Apache Parquet", "Parquet", "Apache Phoenix", "Phoenix", "Apache Spark", "Spark", "Apache Storm", "Storm", "ApacheCon", and their logos are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries. All other brands and trademarks are the property of their respective owners.
# # #