Apache Open Source library, search, and document management tools used in investigating the biggest leak in journalism history.

Forest Hill, MD —17 April 2017— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today the role played by several Apache projects in the investigation of the Panama Papers.
At 2.6 terabytes of data, the Panama Papers is the largest leak of all time, comprising 11.5M financial and legal records sent from an anonymous source. The journalistic cooperation involved more than 400 journalists from 100 publications on six continents over the course of a year. The discovery exposed a complex system of criminal and corrupt activities secretly hidden by offshore concerns. The investigation recently received a Pulitzer Prize in the Explanatory Reporting category.
"The Apache Software Foundation incorporated 18 years ago with the mission to create software for the public good," said ASF President Sam Ruby. "We are honored that Apache software played a critical role with the Panama Papers, and congratulate the International Consortium of Investigative Journalists and their media partners on this prestigious award."
The discovery, exchange, and management of information that involved 214,488 entities was made possible by:
  • Tika --toolkit that detects and extracts metadata and structured text content from various documents. Used for document processing.
  • Solr --enterprise search server, based on the Lucene Java search library, with advanced highlighting, faceted search, caching, and replication capabilities. Used for search and indexing.
  • PDFBox --Open Source Java library for working with PDF documents. Used for capturing text from PDF documents.
  • POI --Open Source Java library and APIs for various file formats based on Microsoft Office. Used to extract and manipulate Excel, Word, and PowerPoint files.
  • Commons --40+ projects for reusable Open Source Java components. Used to boost cross-platform development and productivity.

In addition to Apache software, a number of other Open Source projects were also integral to the investigation. This includes Tesseract-ocr (whose optical character recognition engine was used for capturing text from images), Project Blacklight (used as a discovery interface), and Jackcess (used for reading and writing MS Access databases): three examples of the millions of software solutions distributed under the Apache License v2.0, that allows for their free use, modification, and sharing.

Apache Open Source Projects
Many of the ASF's 300+ projects serve as the backbone for some of the world's most visible and widely used applications in Artificial Intelligence and Deep Learning, Big Data, Build Management, Cloud Computing, Content Management, DevOps, IoT and Edge Computing, Mobile, Servers, and Web Frameworks, among other categories.
Programmers, solutions architects, individual users, educators, researchers, corporations, governments, and enthusiasts worldwide depend on Apache software for development tools, libraries, frameworks, visualizers, end-user productivity solutions, and more.
75% of Apache's 150M lines of code have been developed over 65,000 person years, and are valued at US$7B. The ASF serves approximately 9M source code downloads from Apache mirrors on a yearly basis, excluding convenience binaries. Worldwide dependency on Apache software continues to grow, with Web requests received from every Internet-connected country on the planet.
The Apache Incubator is home to 63 projects undergoing development, with emerging innovations Big Data, communication protocols, connected devices, cryptography, data science/machine learning/analytics, development frameworks, microfinances, remote desktop access, serverless computing, and more.
All Apache products are available to the public-at-large completely free of charge. All software development and project leadership is done entirely by volunteers. As a not-for-profit charitable organization, the ASF is funded through tax-deductible contributions from corporations, foundations, and private individuals. Approximately 75% of the ASF's US$1.2M annual budget is dedicated to running critical infrastructure support services that keep Apache services running 24x7x365 at near 100% uptime on an annual budget of less than US$5,000 per project.
About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 620 individual Members and 6,000 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Capital One, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, ODPi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF
© The Apache Software Foundation. "Apache", "Apache Commons", "PDFBox", "Apache PDFBox", "POI", "Apache POI", "Solr", "Apache Solr", "Tika", "Apache Tika", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.
# # #