I’m currently working on improving RowSimilarityJob, one of my first contributions to Mahout. It’s a Map/Reduce job to compute the pairwise similarities between the row vectors of a sparse matrix. While this is a problem with quadratic worst case runtime, one can achieve linear scalability when certain sparsity constraints of the matrix are fulfilled and appropriate downsampling is used.
This is part of my work for the ROBUST research project where this algorithm can be used to find near-duplicates of user posts in forums or to predict missing links in social graphs.
Here’s a picture of my current approach, more details to come: