Sunday, April 24, 2011

Using Dumbo to connect Cascading and Python

This post describes how I use Dumbo to allow me to combine code written in python with Cascading. I want to use Cascading to express complicated work flows which can then be scheduled as a series of map-reduce jobs. Dumbo 0.21 added support for expressing jobs as DAGs; which partially addresses one of my main reasons for using Cascading. Cascading, however, also supports operations like merging tuple streams which is a big requirement for the types of work flows I create. Finally, I want to use Python because I think its an ideal, high-level, language for rapidly prototyping and implementing the operations that I use in my work flows.

My starting point was Nate Murray's post which showed how to integrate a Hadoop Streaming Job into Cascading. The limits of this approach are

Data is passed in and out of the python code as as text
You have to parse the fields out of the lines; i.e the data is no longer represented as tuples.

My solution has the following elements:

We create a customized cascading tap which will encode/decode tuples using Hadoop Typed Bytes
We "wrap" our python code using Dumbo so that we can let Dumbo handle converting the typedbytes to native python types.
We setup a cascading streaming job which uses Dumbo to run our Map-Reduce job.

Creating A Cascading Tap For TypedBytes

Update 04-26-2011. There's a problem with my original code. The source part of the tap doesn't properly set the field names based on the names in the dictionary.

~~Below is my code for creating a custom tap to read/write typed bytes data.~~

The tuples are encoded as field name, value pairs. This is space inefficient because the field names are the same for each tuple. Alternatively, we could just encode the values and then use some mechansim external to the sequence file to store the mapping between fields and tuple poisition . I found this latter approach to be to error prone when working on the python side. Its far easier to encode the field names so that in python you get a dictionary when you decode the typed bytes.

~~The code is based on the TextLine tap in cascading which I modified to suit my needs. I only really needed to modify the source and sink functions.~~

import java.beans.ConstructorProperties;
import java.io.IOException;

import cascading.tap.Tap;
import cascading.tuple.Fields;
import cascading.tuple.Tuple;
import cascading.tuple.TupleEntry;
import cascading.tuple.Tuples;
import cascading.scheme.Scheme;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.typedbytes.*;

/**
 * A TypedBytesMapScheme is a type of {@link Scheme}, which uses
 * sequence files to encode the tuples. 
 * The tuples are encoded as JavaMap objects and then serialized using hadoop typed bytes.
 * We use Map objects so that we know which field corresponds to each position. 
 * We waste space by encoding the field names with each tuple, but when we decode
 * the tuples in python we get a dictionary so we know what each field is.
 */
public class TypedBytesMapScheme extends Scheme
{
 /** Field serialVersionUID */
 private static final long serialVersionUID = 1L;

 /** Protected for use by TempDfs and other subclasses. Not for general consumption. */
 protected TypedBytesMapScheme()
 {
  super( null );
 }

 /**
  * Creates a new SequenceFile instance that stores the given field names.
  *
  * @param fields
  */
 @ConstructorProperties({"fields"})
 public TypedBytesMapScheme( Fields fields )
 {
  super( fields, fields );
 }

 @Override
 public void sourceInit( Tap tap, JobConf conf )
 {
  conf.setInputFormat( SequenceFileInputFormat.class );
 }

 @Override
 public void sinkInit( Tap tap, JobConf conf )
 {
  conf.setOutputKeyClass( TypedBytesWritable.class ); // supports TapCollector
  conf.setOutputValueClass( TypedBytesWritable.class ); // supports TapCollector
  conf.setOutputFormat( SequenceFileOutputFormat.class );
 }

 @Override
 public Tuple source( Object key, Object value ) 
 {

  //cast value to a typedbytes writable objects
  TypedBytesWritable bytes = (TypedBytesWritable)value;

  //It should be a Map
  java.util.Map <String, Object>  items=(java.util.Map <String, Object >) bytes.getValue();
  
  //create a new tuple
  Tuple data=new Tuple();
  
  Fields sfields=this.getSourceFields();
  
  java.util.Iterator it=sfields.iterator();
  
  for (int i=0;i<sfields.size();i++){
   //get the field at position i
   Comparable f=sfields.get(i);
   
   //Will it be a string or a io.Text?
   if (f instanceof org.apache.hadoop.io.Text){
    org.apache.hadoop.io.Text name = (org.apache.hadoop.io.Text)f;
    data.add(items.get(name.toString()));   
   }
   if (f instanceof String ){
    String name = (String)f;
    data.add(items.get(name));   
   }
   else{
    //Not sure what to do. Throwing an exception appears to cause a problem
    System.out.println("Error TypedBytesMapScheme line 114: field name wasn't a string or text not sure what to do. Not sure which item to get. Will just get this position argument");
    data.add(items.get(i));
    //throw new Exception("Field isn't named by a string. We can't handle this currently");
   }
  }
  

  return data;
 }

 @Override
 public void sink( TupleEntry tupleEntry, OutputCollector outputCollector ) throws IOException
 {
  Tuple result = getSinkFields() != null ? tupleEntry.selectTuple( getSinkFields() ) : tupleEntry.getTuple();

  java.util.Map <String,Object> objs=new java.util.HashMap<String,Object>();  

  //loop over all the enteries in the tuple
  
  Object item;
  
  //tupleentry.getFields() returns the fields actually in the tuple
  // getSinkFields() is the field selector in the tap
  // we apply this selector to the actual fields 
  Fields sfields=tupleEntry.getFields().select(getSinkFields());
  
  if (sfields == null){
   System.out.println("\t TypedBytesMapScheme: sinkfields is Null");
  }
  if (sfields.isAll()){
   //System.out.println("\t TypedBytesMapScheme: sfields is ALL");
   //Since we want to select all fields use fields in the tuple entry
   
  }
  System.out.println("\t TypedBytesMapScheme: sfields.size="+sfields.size());
  for (int i=0;i < sfields.size();i++){
   
   //get the name of this field
   String name = sfields.get(i).toString();
   
   item = result.get(i); 
   //check if Item implements the writable inteface
   //Do we need to do this , why can't we just get the item
   // and set it as the value
   if (item instanceof org.apache.hadoop.io.Text){
    item=item.toString();
   }
   else if(item instanceof org.apache.hadoop.io.LongWritable){
    //assume its one of the objects line LongWritable that implements a get function to get the value
    item=((org.apache.hadoop.io.LongWritable)item).get();
   }
   else if(item instanceof org.apache.hadoop.io.IntWritable){
    //assume its one of the objects line LongWritable that implements a get function to get the value
    item=((org.apache.hadoop.io.IntWritable)item).get();
   }
   else if(item instanceof org.apache.hadoop.io.DoubleWritable){
    //assume its one of the objects line LongWritable that implements a get function to get the value
    item=((org.apache.hadoop.io.DoubleWritable)item).get();
   }

   
   //item.getClass().
   objs.put(name, item);
  }

  TypedBytesWritable value = new TypedBytesWritable();
  TypedBytesWritable key = new TypedBytesWritable();
  key.setValue(0);
  value.setValue(objs);
  outputCollector.collect(key,value );
 }
}

Setting up a Cascading Flow which uses Dumbo

Below is the java code for setting up the MapReduce flow which uses Dumbo to call the python/ mapper reducer.

//Path to the python executable
String pyexe="/usr/local/python2.6/bin/python ";
 
//path to the python script to execute  
String pyscript="Utilities/dpf/cascading/run_each_pipe.py";

//Paths of the input/output files for our job
//These files will be written/read by other flows in our cascade
//The should contain tuples encoded using our TypedBytesMapScheme.
String pymrin="somepath/in";
String pymrout="somepath/out";

//create a list of directories we need on our python path
//common.pyc is part of the dumbo/backends and typedbytes should be in our
//python path so we probably don't need the next line.
String pypath="common.pyc:typedbytes-0.3.6-py2.6.egg";


//How much memory to allow the Mapper/Reducer
//in bytes. I had to increase this beyond the default
//otherwise my python scripts ran into memory errors loading c-extensions
String memlimit="1256000000";

//set the home directory and python egg cache
//I use the LinuxTaskController so that MR jobs
//run as the user who submitted them so we can 
//access files in our home directory
//We still need to set the home directory otherwise it defaults
//to the home directory for the map reduce user.
String homedir=System.getProperty("user.home");
//set the python egg cache to the users home directory
String pyeggcache=System.getProperty("user.home")+"/.python-eggs";

//We can create the job configuration using the same paramters
//we would pass on the command line
//we specify the reducer as Identity to make it mapper only?
JobConf streamConf = StreamJob.createJob( new String[]{
  "-input", pymrin, 
  "-output", pymrout,  

  // The first argument is the iteration number which should just be 0
  // The second number determines how much memory the process can use
  // its equivalent to using the -memlimit command with dumbo
  "-mapper", pyexe+pyscript+ "  map 0 "+memlimit,
  //don't specify the reducer to make it map only?
  "-reducer", "/usr/local/python2.6/bin/python "+pyscript +" red  0 "+memlimit,
  //Environment variables used by dumbo
  "-cmdenv","dumbo_mrbase_class=dumbo.backends.common.MapRedBase",
  "-cmdenv","dumbo_jk_class=dumbo.backends.common.JoinKey",
  "-cmdenv","dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo",
  "-cmdenv","PYTHONPATH="+pypath,
  "-cmdenv","PYTHON_EGG_CACHE="+pyeggcache,
  //set the home directory to the user submitting the job because
  //otherwise it appears to be set to the mapred user which causes problems
  "-cmdenv","HOME="+homedir,     
  "-inputformat","org.apache.hadoop.streaming.AutoInputFormat",
  //use a sequence file for the output format,
  "-outputformat","org.apache.hadoop.mapred.SequenceFileOutputFormat",
  "-jobconf","stream.map.input=typedbytes",
  "-jobconf","stream.reduce.input=typedbytes", 
  "-jobconf","stream.map.output=typedbytes", 
  "-jobconf","stream.reduce.output=typedbytes",
  "-jobconf","mapred.job.name="+pipe_func,
  //Increase the memory for the virtual machines so we don't run out of memory loading the dll
  //"-jobconf","mapred.child.java.opts=-Xmx1000m"    

});
boolean deleteSinkOnInit=true;
mrflow = new MapReduceFlow("streaming flow", streamConf, deleteSinkOnInit);

Python Job

On the python side, you create Mappers and Reducers as you would for a regular dumbo job

class Mapper(Base):
 """
 This is the base mapper which pushes data through a pipe 
 """
 
 def __call__(self,key,value):
  """Push the values through the operator

  Key - shouldn't contain any useful information
  value - Should be a dictionary representing the tuple 
  """

  #..your code..
  yield out_key,out
   
   
class Reducer(Base):
 """
 The reducer
 """
 
 def __call__(self,key,valgen):
  """Push the values through the operator
  
  Parameters
  ------------------------------------------------------------------------------------------------
  valgen - Generator to  iterate over the values
  """

  #.. your code here  
  yield key,out

def run():
 """We add this run function so that other functions can actually invoke it."""

 import dumbo
 job=dumbo.Job()
 job.additer(Mapper,Reducer)
 job.run()

if __name__ == "__main__":
        run()

Thursday, April 14, 2011

Problems setting up LinuxTaskController with Hadoop (cloudera release 3)

This post describes some of the problems I ran into trying to setup the LinuxTaskController using Cloudera (CDH3u0). I wanted to setup the LinuxTaskController so that map reduce jobs would run as the user who submitted them.

I started by following the Instructions for setting up security in CDH3 (Note. I skipped all the steps except installing the secure packages and setting up the secure mapreduce).

Most of the problems I had were because of spacing issues in the taskcontroller. cfg file.

I needed to add at least 1 newline after the final line of the taskcontroller.cfg (which for me sets the value of mapred.tasktracker.group). (This was a known bug in CDH3B4 but since its no longer described in the latest docs, I assume its supposed to be fixed. Its possible it was fixed and there was a problem with my upgrade from CDH3B4 to CDH3u0)
Extra spaces at the end of lines.

I had extra spaces at the end of the line "mapred.local.dir=/somedir"
This caused the task controller to fail to start any task attempts
I discovered this by looking at my task controller log file where I saw an exception:

Failed to create directory "/somedir /tasktracker/jlewilocal" (notice the space)

Saturday, April 9, 2011

Debugging Hadoop Streaming using Eclipse

This post describes how I setup Eclipse so I could debug what was happening when I tried to run a Hadoop Streaming Job with the "-conf" option.

It is based on:

I'm not a java or Eclipse expert so if you see a way to do something better please let me know.

I'm using the Helios release of Eclipse and the Clourdera 3 Beta 4 release for Hadoop.

Setting Up Eclipse

Download a tarball containing the cloudera source form Cloudera 3 archive
Unpack the tarball
In Eclipse create a new project
In the "Java Settings" dialog set the default output folder to something other than bin
```
"build/eclipse-classes"
```
Close the project in eclipse
Copy the contents of the unpacked cloudera tarball to the directory for your new eclipse project e.g
```
mv some-dir/hadoop-0.20.2-CDH3B4/*  ~/workspace/your-project/
```
Open the project and refresh it (right click the project and hit F5
At this point in Eclipse package explorer you should see src listed under your project
Setup the src folders in Eclipse

Right click on properties for your project
select java build path->source tab
Remove the entry project/src
Click add folder and add the following entries

src/core
src/contrib/streaming/src/java
src/mapred
Set the output folder to project/eclipse-build

Click on Libraries and add all the jars in

project/lib

Click on the menu project and uncheck build automatically
Create a new ant builder

Right click on project properties
Select builders
Click new and select Ant Builder
For the buildfile select browse workspace and then select build.xml in your project
click on targets and set after a "clean"

compile-core-classes, compile-core
For manual build set the targets to compile, compile-core-classes, compile-mapred-clases, compile-contrib

You should now be able to build it by clicking project-> build project
To debug a file e.g "HadoopStreaming.java" just right click and select debug as "java application"

If you get a warning about errors in project try clicking on proceed

Final Notes

The wiki describes an ant task which automatically sets up Eclipse. Unfortunately the eclipse templates don't appear to be included in the cloudera CDH3B4 release. Although you could try downloading them from the Apache repository.
On another system I had some trouble getting Eclipse to stop at breakpoints. This seems to be java+Eclipse issue.

Sunday, March 6, 2011

Mahout and Python Integration Using JPype

I recently used Mahout to cluster some data using KMeans. After running Mahout, I found myself wondering how to get the output out of the sequence files Mahout created and into Python which I use for data analysis and plotting.

I decided to see if I could use JPype to read the sequence files directly from Python. This turned out to be quite easy so I decided to post some helpful instruction on how to do it.

JPython wasn't an option for me, because (to the best of my knowledge) JPython doesn't work with Python extensions numpy, matplotlib, or h5py which I rely on heavily.

The instructions below explain how to setup a python script to read and write the output of Mahout clustering.

You will first need to download and install the JPype package for python.

The first step to setting up JPype is determining the path to the dynamic library for the jvm ; on linux this will be a .so file on and on windows it will be a .dll.

In your python script, create a global variable with the path to this dll

jvmlib="/usr/java/jdk1.6.0_23/jre/lib/amd64/server/libjvm.so"

Next we need to figure out how we need to set the classpath for mahout. The easiest way to do this is to edit the script in "bin/mahout" to print out the classpath. Add the line "echo $CLASSPATH" to the script somewhere after the comment "run it" (this is line 195 or so). Execute the script to print out the classpath. Copy this output and paste it into a variable in your python script. The result for me looks like the following

classpath="/usr/local/programs/svn_mahout/conf::/usr/java/jdk1.6.0_23/lib/tools.jar:/usr/local/programs/svn_mahout/mahout-*.jar:/usr/local/programs/svn_mahout/core/target/mahout-core-0.5-SNAPSHOT-job.jar:/usr/local/programs/svn_mahout/examples/target/mahout-examples-0.5-SNAPSHOT-job.jar:/usr/local/programs/svn_mahout/lib/*.jar:/usr/local/programs/svn_mahout/examples/target/dependency/cglib-nodep-2.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-beanutils-1.7.0.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-cli-1.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-cli-2.0-mahout.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-codec-1.3.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-collections-3.2.1.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-dbcp-1.2.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-digester-1.7.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-httpclient-3.1.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-lang-2.4.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-logging-1.1.1.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-math-1.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/commons-pool-1.4.jar:/usr/local/programs/svn_mahout/examples/target/dependency/easymock-2.5.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/easymockclassextension-2.5.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/google-collections-1.0-rc2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/gson-1.3.jar:/usr/local/programs/svn_mahout/examples/target/dependency/guava-r03.jar:/usr/local/programs/svn_mahout/examples/target/dependency/hadoop-core-0.20.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/hbase-0.20.0.jar:/usr/local/programs/svn_mahout/examples/target/dependency/jets3t-0.7.1.jar:/usr/local/programs/svn_mahout/examples/target/dependency/junit-4.7.jar:/usr/local/programs/svn_mahout/examples/target/dependency/lucene-analyzers-3.0.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/lucene-benchmark-3.0.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/lucene-core-3.0.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/lucene-demos-3.0.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/lucene-highlighter-3.0.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/lucene-memory-3.0.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/lucene-wikipedia-3.0.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/mahout-collections-1.0.jar:/usr/local/programs/svn_mahout/examples/target/dependency/mahout-core-0.5-SNAPSHOT.jar:/usr/local/programs/svn_mahout/examples/target/dependency/mahout-core-0.5-SNAPSHOT-tests.jar:/usr/local/programs/svn_mahout/examples/target/dependency/mahout-math-0.5-SNAPSHOT.jar:/usr/local/programs/svn_mahout/examples/target/dependency/mahout-math-0.5-SNAPSHOT-tests.jar:/usr/local/programs/svn_mahout/examples/target/dependency/mahout-utils-0.5-SNAPSHOT.jar:/usr/local/programs/svn_mahout/examples/target/dependency/objenesis-1.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/slf4j-api-1.6.0.jar:/usr/local/programs/svn_mahout/examples/target/dependency/slf4j-jcl-1.6.0.jar:/usr/local/programs/svn_mahout/examples/target/dependency/uncommons-maths-1.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/watchmaker-framework-0.6.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/watchmaker-swing-0.6.2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/xml-apis-1.0.b2.jar:/usr/local/programs/svn_mahout/examples/target/dependency/xpp3_min-1.1.4c.jar:/usr/local/programs/svn_mahout/examples/target/dependency/xstream-1.3.1.jar"

Now we can create a function to start the jvm in python using jype

from jpype import *
jvm=None
def start_jpype():
 global jvm
 if (jvm is None):  
  cpopt="-Djava.class.path={cp}".format(cp=classpath)
  startJVM(jvmlib,"-ea",cpopt)
  jvm="started"

We can now use JPype to create sequence files which will contain vectors to be used by Mahout for kmeans. The example below is a function which creates vectors from two Gaussian distributions with unit variance.

def create_inputs(ifile,*args,**param):
 """Create a sequence file containing some normally distributed
        ifile - path to the sequence file to create
 """
 
 #matrix of the cluster means
 cmeans=np.array([[1,1],[-1,-1]],np.int)
 
 nperc=30  #number of points per cluster
 
 vecs=[]
 
 vnames=[]
 for cind in range(cmeans.shape[0]):
  pts=np.random.randn(nperc,2)
  pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]])
  vecs.append(pts)
 
  #names for the vectors
  #names are just the points with an index
  #we do this so we can validate by cross-refencing the name with the vector
  vn=np.empty(nperc,dtype=(np.str,30))
  for row in range(nperc):
   vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
  vnames.append(vn)
  
 vecs=np.vstack(vecs)
 vnames=np.hstack(vnames)
 

 #start the jvm
 start_jpype()
 
 #create the sequence file that we will write to
 io=JPackage("org").apache.hadoop.io 
 FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
 
 PathCls=JPackage("org").apache.hadoop.fs.Path
 path=PathCls(ifile)

 ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
 conf=ConfCls()
 
 fs=FileSystemCls.get(conf)
 
 #vector classes
 VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
 DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
 NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
 writer=io.SequenceFile.createWriter(fs, conf, path, io.Text,VectorWritableCls)
 
 
 vecwritable=VectorWritableCls()
 for row in range(vecs.shape[0]):
  nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row])
  #need to wrap key and value because of overloading
  wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
  wrapval=JObject(vecwritable,io.Writable)
  
  vecwritable.set(nvector)
  writer.append(wrapkey,wrapval)
  
 writer.close()

Similarly we can use JPype to easily read the clustered points outputted by mahout.

def read_clustered_pts(ifile,*args,**param):
 """Read the clustered points
 ifile - path to the sequence file containing the clustered points
 """ 

 #start the jvm
 start_jpype()
 
 #create the sequence file that we will write to
 io=JPackage("org").apache.hadoop.io 
 FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
 
 PathCls=JPackage("org").apache.hadoop.fs.Path
 path=PathCls(ifile)

 ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
 conf=ConfCls()
 
 fs=FileSystemCls.get(conf)
 
 #vector classes
 VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
 NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
 
 
 ReaderCls=io.__getattribute__("SequenceFile$Reader") 
 reader=ReaderCls(fs, path,conf)
 

 key=reader.getKeyClass()()
 

 valcls=reader.getValueClass()
 vecwritable=valcls()
 while (reader.next(key,vecwritable)):  
  weight=vecwritable.getWeight()
  nvec=vecwritable.getVector()
  
  cname=nvec.__class__.__name__
  if (cname.rsplit('.',1)[1]=="NamedVector"):  
   print "cluster={key} Name={name} x={x} y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
  else:
   raise NotImplementedError("Vector isn't a NamedVector. Need to modify/test the code to handle this case.")

Finally we can create a function to print out the actual cluster centers found by mahout,

def getClusters(ifile,*args,**param):
 """Read the centroids from the clusters outputted by kmenas
           ifile - Path to the sequence file containing the centroids
 """ 

 #start the jvm
 start_jpype()
 
 #create the sequence file that we will write to
 io=JPackage("org").apache.hadoop.io 
 FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
 
 PathCls=JPackage("org").apache.hadoop.fs.Path
 path=PathCls(ifile)

 ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
 conf=ConfCls()
 
 fs=FileSystemCls.get(conf)
 
 #vector classes
 VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
 NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
 ReaderCls=io.__getattribute__("SequenceFile$Reader")
 reader=ReaderCls(fs, path,conf)
 

 key=io.Text()
 

 valcls=reader.getValueClass()

 vecwritable=valcls()
 
 while (reader.next(key,vecwritable)):  
  center=vecwritable.getCenter()
  
  print "id={cid} center={center}".format(cid=vecwritable.getId(),center=center.values)
  pass