Class Analyzer

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable
    Direct Known Subclasses:
    AnalyzerWrapper, CollationKeyAnalyzer, CustomAnalyzer, DutchAnalyzer, ICUCollationKeyAnalyzer, KeywordAnalyzer, KoreanAnalyzer, SimpleAnalyzer, SmartChineseAnalyzer, StopwordAnalyzerBase, UnicodeWhitespaceAnalyzer, WhitespaceAnalyzer

    public abstract class Analyzer
    extends java.lang.Object
    implements java.io.Closeable
    An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

    In order to define what analysis is done, subclasses must define their TokenStreamComponents in createComponents(String). The components are then reused in each call to tokenStream(String, Reader).

    Simple example:

     Analyzer analyzer = new Analyzer() {
      @Override
       protected TokenStreamComponents createComponents(String fieldName) {
         Tokenizer source = new FooTokenizer(reader);
         TokenStream filter = new FooFilter(source);
         filter = new BarFilter(filter);
         return new TokenStreamComponents(source, filter);
       }
       @Override
       protected TokenStream normalize(TokenStream in) {
         // Assuming FooFilter is about normalization and BarFilter is about
         // stemming, only FooFilter should be applied
         return new FooFilter(in);
       }
     };
     
    For more examples, see the Analysis package documentation.

    For some concrete implementations bundled with Lucene, look in the analysis modules:

    • Common: Analyzers for indexing content in different languages and domains.
    • ICU: Exposes functionality from ICU to Apache Lucene.
    • Kuromoji: Morphological analyzer for Japanese text.
    • Morfologik: Dictionary-driven lemmatization for the Polish language.
    • Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
    • Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
    • Stempel: Algorithmic Stemmer for the Polish Language.
    Since:
    3.1
    • Method Detail

      • tokenStream

        public final TokenStream tokenStream​(java.lang.String fieldName,
                                             java.io.Reader reader)
        Returns a TokenStream suitable for fieldName, tokenizing the contents of reader.

        This method uses createComponents(String) to obtain an instance of Analyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through Analyzer.TokenStreamComponents.setReader(Reader).

        NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Analysis package documentation for some examples demonstrating this. NOTE: If your data is available as a String, use tokenStream(String, String) which reuses a StringReader-like instance internally.

        Parameters:
        fieldName - the name of the field the created TokenStream is used for
        reader - the reader the streams source reads from
        Returns:
        TokenStream for iterating the analyzed content of reader
        Throws:
        AlreadyClosedException - if the Analyzer is closed.
        See Also:
        tokenStream(String, String)
      • tokenStream

        public final TokenStream tokenStream​(java.lang.String fieldName,
                                             java.lang.String text)
        Returns a TokenStream suitable for fieldName, tokenizing the contents of text.

        This method uses createComponents(String) to obtain an instance of Analyzer.TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through Analyzer.TokenStreamComponents.setReader(Reader).

        NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Analysis package documentation for some examples demonstrating this.

        Parameters:
        fieldName - the name of the field the created TokenStream is used for
        text - the String the streams source reads from
        Returns:
        TokenStream for iterating the analyzed content of reader
        Throws:
        AlreadyClosedException - if the Analyzer is closed.
        See Also:
        tokenStream(String, Reader)
      • normalize

        public final BytesRef normalize​(java.lang.String fieldName,
                                        java.lang.String text)
        Normalize a string down to the representation that it would have in the index.

        This is typically used by query parsers in order to generate a query on a given term, without tokenizing or stemming, which are undesirable if the string to analyze is a partial word (eg. in case of a wildcard or fuzzy query).

        This method uses initReaderForNormalization(String, Reader) in order to apply necessary character-level normalization and then normalize(String, TokenStream) in order to apply the normalizing token filters.

      • initReader

        protected java.io.Reader initReader​(java.lang.String fieldName,
                                            java.io.Reader reader)
        Override this if you want to add a CharFilter chain.

        The default implementation returns reader unchanged.

        Parameters:
        fieldName - IndexableField name being indexed
        reader - original Reader
        Returns:
        reader, optionally decorated with CharFilter(s)
      • getPositionIncrementGap

        public int getPositionIncrementGap​(java.lang.String fieldName)
        Invoked before indexing a IndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IndexbleField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries.
        Parameters:
        fieldName - IndexableField name being indexed.
        Returns:
        position increment gap, added to the next token emitted from tokenStream(String,Reader). This value must be >= 0.
      • getOffsetGap

        public int getOffsetGap​(java.lang.String fieldName)
        Just like getPositionIncrementGap(java.lang.String), except for Token offsets instead. By default this returns 1. This method is only called if the field produced at least one token for indexing.
        Parameters:
        fieldName - the field just indexed
        Returns:
        offset gap, added to the next token emitted from tokenStream(String,Reader). This value must be >= 0.
      • setVersion

        public void setVersion​(Version v)
        Set the version of Lucene this analyzer should mimic the behavior for for analysis.
      • getVersion

        public Version getVersion()
        Return the version of Lucene this analyzer will mimic the behavior of for analysis.
      • close

        public void close()
        Frees persistent resources used by this Analyzer
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable