The all-volunteer Apache Software Foundation (ASF) develops, stewards, and incubates nearly 150 Open Source projects and initiatives, many of which power mission-critical applications in financial services, aerospace, publishing, government, healthcare, research, infrastructure, and more.
Did you know that 50% of the Top 10 downloaded Open Source products are Apache projects?
Did you know that Europe's DICODE academic and industry project relies on Apache Mahout for large scale data mining?
We are pleased to showcase Apache Mahout, the scalable, professional-grade machine learning project at Apache for large scale data analysis.

Quick peek: Given the amount of data available in digital form to a huge amount of businesses today, Machine Learning is what helps you make sense of your data and provide better service to your customers: 

  • Given interaction logs of your web shop, Mahout helps come up with good recommendations for products customers might be interested in buying.
  • When faced with an ever increasing stream of news articles Mahout is what helps you to reduce that information load to a manageable amount of groups of topically related articles. 

Apache Mahout provides stable, industry ready implementations of machine learning algorithms that help make more out of your product. The project combines support for efficient standalone deployments with the possibility of scaling to a distributed Apache Hadoop cluster thus making it easy to scale with your business needs. 

Background: Initiated by a group of Apache Lucene developers in summer 2007 the project started out as a Lucene sub project in early 2008. Since that time it has attracted various users from the industry, including large players such as Yahoo! and AOL but also smaller to medium sized businesses like Mippin and Speeddate. Apache Mahout graduated as an Apache Top-Level Project in early 2010.
Why Mahout: Apache Mahout includes features that make building modern data-driven features easier, including:
  • Clustering, that is grouping items only based on their similarity;
  • Classification, that is assigning items to pre-defined categories;
  • Recommendation, that is identifying items a user might like based on his behaviour;
  • Frequent Itemset Mining, that is identifying items that usually appear together e.g. in a customer purchase
Apache Mahout is the only machine learning project that combines the advantages of having 
  •  a permissive open source license supporting almost any business use-case you can think of;
  •  a very active community responding to user requests and helping analyse your specific data problems;
  •  a production ready implementation of algorithms covering most of the sophisticated data analysis jobs you would want to run on your data while still being open and easy to adjust to your specific needs.
What's under the hood: Mahout 0.4  improves the overall application development experience through
  • Model refactoring and CLI changes to improve integration and consistency
  • New ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
  • New VectorModelClassifier allows any set of clusters to be used for classification
  • RecommenderJob has been evolved to a fully distributed item-based recommender
  • More algorithms supported like Spectral Clustering and MinHash Clustering (still experimental), HMM based sequence classification from GSoC (currently as sequential version only and still experimental), new type of NB classifier, and feature reduction options for existing one, new Sequential logistic regression training framework, new SGD classifier
  • New vector encoding framework for high speed vectorization without a pre-built dictionary
  • Promoted several pieces of old Colt framework to tested status (QR decomposition, in particular)
  • Distributed Lanczos SVD implementation
  • Many, many small fixes, improvements, refactorings and cleanup
Latest release: Apache Mahout v.0.4 on 31 October 2010 under the Apache License v.2.0. More details can be found in the release notes.
Downloads, documentation, examples, and more information: visit .
# # #