Lucene with PlingStemmer

I’ve been recently working with Java Lucene and its Analyzers and for I project I worked on the client needed to use the Porter Stemmer algorithm. I used the SnowballAnalyzer, but unfortunately I found out that, as someone before me said, Porter stemmer works right on 90% of the cases, but when it fails, it fails hard! The example is the following: consider the words “organic”, “organ” and “organization” .. the three words haven’t a lot in common except of their prefix, thei do not mean the sameĀ  tihng… but for Porter (and for the Snowball Analyzer) they’re stemmed into “organ”.. in Lucene 3.1.x release there will be plenty new features allowing programmers to control and fine tune each stemming algorithm.

So, what can I do since I must use the 3.0.3 release? Well.. I created a new PlingStemmerFilter using YAGO java Pling stemmer implementation following instructions found here.

Here you can find only the PlingStemmerFilter class, which uses the classes PlingStemmer, FinalMap and FinalSet from the YAGO package (you can download the full package from YAGO javatools webpage).

Here is my Java Class for Lucene:

import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

/**
 *
 * @author thePanz@gmail.com
 */
public final class PlingStemFilter extends TokenFilter {
 private TermAttribute termAtt;

 public PlingStemFilter(TokenStream in) {
  super(in);
  termAtt = addAttribute(TermAttribute.class);
 }

 @Override
 public final boolean incrementToken() throws IOException {
   if (!input.incrementToken())
     return false;

   String stemmed = PlingStemmer.stem(termAtt.term());
   if (!stemmed.isEmpty()) {
     termAtt.setTermBuffer(stemmed);
   }

   return true;
   }
 }

That’s it! Copy and paste the above source code into PlingStemFilter.java and you’re done. This filter can be used like all other Lucene filters.