Class NGramTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable
Direct Known Subclasses:
EdgeNGramTokenizer

public class NGramTokenizer extends Tokenizer
Tokenizes the input into n-grams of the given size(s).

On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.

For example, "abcde" would be tokenized as (minGram=2, maxGram=3):

ngram tokens example
Termababcbcbcdcdcdede
Position increment1111111
Position length1111111
Offsets[0,2[[0,3[[1,3[[1,4[[2,4[[2,5[[3,5[

This tokenizer changed a lot in Lucene 4.4 in order to:

  • tokenize in a streaming fashion to support streams which are larger than 1024 chars (limit of the previous version),
  • count grams based on unicode code points instead of java chars (and never split in the middle of surrogate pairs),
  • give the ability to pre-tokenize the stream before computing n-grams.

Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).

  • Field Details

    • DEFAULT_MIN_NGRAM_SIZE

      public static final int DEFAULT_MIN_NGRAM_SIZE
      See Also:
    • DEFAULT_MAX_NGRAM_SIZE

      public static final int DEFAULT_MAX_NGRAM_SIZE
      See Also:
    • charBuffer

      private CharacterUtils.CharacterBuffer charBuffer
    • buffer

      private int[] buffer
    • bufferStart

      private int bufferStart
    • bufferEnd

      private int bufferEnd
    • offset

      private int offset
    • gramSize

      private int gramSize
    • minGram

      private int minGram
    • maxGram

      private int maxGram
    • exhausted

      private boolean exhausted
    • lastCheckedChar

      private int lastCheckedChar
    • lastNonTokenChar

      private int lastNonTokenChar
    • edgesOnly

      private boolean edgesOnly
    • termAtt

      private final CharTermAttribute termAtt
    • posIncAtt

      private final PositionIncrementAttribute posIncAtt
    • posLenAtt

      private final PositionLengthAttribute posLenAtt
    • offsetAtt

      private final OffsetAttribute offsetAtt
  • Constructor Details

    • NGramTokenizer

      NGramTokenizer(int minGram, int maxGram, boolean edgesOnly)
    • NGramTokenizer

      public NGramTokenizer(int minGram, int maxGram)
      Creates NGramTokenizer with given min and max n-grams.
      Parameters:
      minGram - the smallest n-gram to generate
      maxGram - the largest n-gram to generate
    • NGramTokenizer

      NGramTokenizer(AttributeFactory factory, int minGram, int maxGram, boolean edgesOnly)
    • NGramTokenizer

      public NGramTokenizer(AttributeFactory factory, int minGram, int maxGram)
      Creates NGramTokenizer with given min and max n-grams.
      Parameters:
      factory - AttributeFactory to use
      minGram - the smallest n-gram to generate
      maxGram - the largest n-gram to generate
    • NGramTokenizer

      public NGramTokenizer()
      Creates NGramTokenizer with default min and max n-grams.
  • Method Details

    • init

      private void init(int minGram, int maxGram, boolean edgesOnly)
    • incrementToken

      public final boolean incrementToken() throws IOException
      Description copied from class: TokenStream
      Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate AttributeImpls with the attributes of the next token.

      The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.captureState() to create a copy of the current attribute state.

      This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.addAttribute(Class) and AttributeSource.getAttribute(Class), references to all AttributeImpls that this stream uses should be retrieved during instantiation.

      To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in TokenStream.incrementToken().

      Specified by:
      incrementToken in class TokenStream
      Returns:
      false for end of stream; true otherwise
      Throws:
      IOException
    • updateLastNonTokenChar

      private void updateLastNonTokenChar()
    • consume

      private void consume()
      Consume one code point.
    • isTokenChar

      protected boolean isTokenChar(int chr)
      Only collect characters which satisfy this condition.
    • end

      public final void end() throws IOException
      Description copied from class: TokenStream
      This method is called by the consumer after the last token has been consumed, after TokenStream.incrementToken() returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.

      This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

      Additionally any skipped positions (such as those removed by a stopfilter) can be applied to the position increment, or any adjustment of other attributes where the end-of-stream value may be important.

      If you override this method, always call super.end().

      Overrides:
      end in class TokenStream
      Throws:
      IOException - If an I/O error occurs
    • reset

      public final void reset() throws IOException
      Description copied from class: TokenStream
      This method is called by a consumer before it begins consumption using TokenStream.incrementToken().

      Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.

      If you override this method, always call super.reset(), otherwise some internal state will not be correctly reset (e.g., Tokenizer will throw IllegalStateException on further usage).

      Overrides:
      reset in class Tokenizer
      Throws:
      IOException