java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ScriptIterator

final class ScriptIterator extends Object
An iterator that locates ISO 15924 script boundaries in text.

This is not the same as simply looking at the Unicode block, or even the Script property. Some characters are 'common' across multiple scripts, and some 'inherit' the script value of text surrounding them.

This is similar to ICU (internal-only) UScriptRun, with the following differences:

  • Doesn't attempt to match paired punctuation. For tokenization purposes, this is not necessary. It's also quite expensive.
  • Non-spacing marks inherit the script of their base character, following recommendations from UTR #24.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private static final int[]
    linear fast-path for basic latin case
    private final boolean
     
    private int
     
    private int
     
    private int
     
    private int
     
    private int
     
    private int
     
    private char[]
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    ScriptIterator(boolean combineCJ)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    private int
    getScript(int codepoint)
    fast version of UScript.getScript().
    (package private) int
    Get the UScript script code for this script run
    (package private) int
    Get the index of the first character after the end of this script run
    (package private) int
    Get the start of this script run
    private static boolean
    isSameScript(int scriptOne, int scriptTwo)
    Determine if two scripts are compatible.
    (package private) boolean
    Iterates to the next script run, returning true if one exists.
    (package private) void
    setText(char[] text, int start, int length)
    Set a new region of text to be examined by this iterator

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • text

      private char[] text
    • start

      private int start
    • limit

      private int limit
    • index

      private int index
    • scriptStart

      private int scriptStart
    • scriptLimit

      private int scriptLimit
    • scriptCode

      private int scriptCode
    • combineCJ

      private final boolean combineCJ
    • basicLatin

      private static final int[] basicLatin
      linear fast-path for basic latin case
  • Constructor Details

    • ScriptIterator

      ScriptIterator(boolean combineCJ)
      Parameters:
      combineCJ - if true: Han,Hiragana,Katakana will all return as UScript.JAPANESE
  • Method Details

    • getScriptStart

      int getScriptStart()
      Get the start of this script run
      Returns:
      start position of script run
    • getScriptLimit

      int getScriptLimit()
      Get the index of the first character after the end of this script run
      Returns:
      position of the first character after this script run
    • getScriptCode

      int getScriptCode()
      Get the UScript script code for this script run
      Returns:
      code for the script of the current run
    • next

      boolean next()
      Iterates to the next script run, returning true if one exists.
      Returns:
      true if there is another script run, false otherwise.
    • isSameScript

      private static boolean isSameScript(int scriptOne, int scriptTwo)
      Determine if two scripts are compatible.
    • setText

      void setText(char[] text, int start, int length)
      Set a new region of text to be examined by this iterator
      Parameters:
      text - text buffer to examine
      start - offset into buffer
      length - maximum length to examine
    • getScript

      private int getScript(int codepoint)
      fast version of UScript.getScript(). Basic Latin is an array lookup