Already you might know, the Apache Hama project provides a set of machine learning algorithms which can be applied in applications with very large scale data in multiple domains.



In this post, I explain how to run BSP-based K-Means algorithm using Apache Hama, assume that you have already installed Hama cluster and you have tested it.



1. Download a Iris data set [Data set Information].



2. Then, run KMeans using (TRUNK version is recommended):

  % % $HAMA_HOME/bin/hama jar hama-examples-x.x.x.jar kmeans /tmp/kmeans.txt /tmp/result 10 3
  ...
  [5.1, 3.5, 1.4, 0.2] belongs to cluster 2
  [4.9, 3.0, 1.4, 0.2] belongs to cluster 2
  [4.7, 3.2, 1.3, 0.2] belongs to cluster 2
  [4.6, 3.1, 1.5, 0.2] belongs to cluster 2
  [5.0, 3.6, 1.4, 0.2] belongs to cluster 2
  ...





And Here's performance comparison with Mahout.