Class SloppyPhraseMatcher
- java.lang.Object
-
- org.apache.lucene.search.PhraseMatcher
-
- org.apache.lucene.search.SloppyPhraseMatcher
-
final class SloppyPhraseMatcher extends PhraseMatcher
Find all slop-valid position-combinations (matches) encountered while traversing/hopping the PhrasePositions.
The sloppy frequency contribution of a match depends on the distance:
- highest freq for distance=0 (exact match).
- freq gets lower as distance gets higher.
Example: for query "a b"~2, a document "x a b a y" can be matched twice: once for "a b" (distance=0), and once for "b a" (distance=2).
Possibly not all valid combinations are encountered, because for efficiency we always propagate the least PhrasePosition. This allows to base on PriorityQueue and move forward faster. As result, for example, document "a b c b a" would score differently for queries "a b c"~4 and "c b a"~4, although they really are equivalent. Similarly, for doc "a b c b a f g", query "c b"~2 would get same score as "g f"~2, although "c b"~2 could be matched twice. We may want to fix this in the future (currently not, for performance reasons).
-
-
Field Summary
Fields Modifier and Type Field Description private DocIdSetIterator
approximation
private boolean
captureLeadMatch
private boolean
checkedRpts
private int
end
private boolean
hasMultiTermRpts
private boolean
hasRpts
private ImpactsDISI
impactsApproximation
private int
leadEndOffset
private int
leadOffset
private int
leadOrd
private int
leadPosition
private int
matchLength
private int
numPostings
private PhrasePositions[]
phrasePositions
private boolean
positioned
private PhraseQueue
pq
private PhrasePositions[][]
rptGroups
private PhrasePositions[]
rptStack
private int
slop
-
Constructor Summary
Constructors Constructor Description SloppyPhraseMatcher(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private boolean
advancePP(PhrasePositions pp)
advance a PhrasePosition and update 'end', return false if exhaustedprivate boolean
advanceRepeatGroups()
At initialization (each doc), each repetition group is sorted by (query) offset.private boolean
advanceRpts(PhrasePositions pp)
pp was just advanced.(package private) DocIdSetIterator
approximation()
Approximation that only matches documents that have all terms.private void
captureLead(PhrasePositions pp)
private int
collide(PhrasePositions pp)
index of a pp2 colliding with pp, or -1 if noneint
endOffset()
The end offset of the current matchint
endPosition()
The end position of the current matchprivate void
fillQueue()
Fill the queue (all pps are already placedprivate java.util.ArrayList<java.util.ArrayList<PhrasePositions>>
gatherRptGroups(java.util.LinkedHashMap<Term,java.lang.Integer> rptTerms)
Detect repetition groups.(package private) ImpactsDISI
impactsApproximation()
Approximation that is aware of impacts.private boolean
initComplex()
with repeats: not so simple.private boolean
initFirstTime()
initialize with checking for repeats.private boolean
initPhrasePositions()
Initialize PhrasePositions in place.private void
initSimple()
no repeats: simplest case, and most common.private PhrasePositions
lesser(PhrasePositions pp, PhrasePositions pp2)
compare two pps, but only by position and offset(package private) float
maxFreq()
An upper bound on the number of possible matches on this documentboolean
nextMatch()
Find the next match on the current document, returningfalse
if there are none.private void
placeFirstPositions()
move all PPs to their first positionprivate java.util.ArrayList<FixedBitSet>
ppTermsBitSets(PhrasePositions[] rpp, java.util.HashMap<Term,java.lang.Integer> tord)
bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is setprivate PhrasePositions[]
repeatingPPs(java.util.HashMap<Term,java.lang.Integer> rptTerms)
find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRptsprivate java.util.LinkedHashMap<Term,java.lang.Integer>
repeatingTerms()
find repeating terms and assign them ordinal valuesvoid
reset()
Called afterPhraseMatcher.approximation()
has been advanced(package private) float
sloppyWeight()
The slop-adjusted weight of the current match The sum of the slop-adjusted weights is used as the freq for scoringprivate void
sortRptGroups(java.util.ArrayList<java.util.ArrayList<PhrasePositions>> rgs)
sort each repetition group by (query) offset.int
startOffset()
The start offset of the current matchint
startPosition()
The start position of the current matchprivate java.util.HashMap<Term,java.lang.Integer>
termGroups(java.util.LinkedHashMap<Term,java.lang.Integer> tord, java.util.ArrayList<FixedBitSet> bb)
map each term to the single group that contains itprivate int
tpPos(PhrasePositions pp)
Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset)private void
unionTermGroups(java.util.ArrayList<FixedBitSet> bb)
union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms-
Methods inherited from class org.apache.lucene.search.PhraseMatcher
getMatchCost
-
-
-
-
Field Detail
-
phrasePositions
private final PhrasePositions[] phrasePositions
-
slop
private final int slop
-
numPostings
private final int numPostings
-
pq
private final PhraseQueue pq
-
captureLeadMatch
private final boolean captureLeadMatch
-
approximation
private final DocIdSetIterator approximation
-
impactsApproximation
private final ImpactsDISI impactsApproximation
-
end
private int end
-
leadPosition
private int leadPosition
-
leadOffset
private int leadOffset
-
leadEndOffset
private int leadEndOffset
-
leadOrd
private int leadOrd
-
hasRpts
private boolean hasRpts
-
checkedRpts
private boolean checkedRpts
-
hasMultiTermRpts
private boolean hasMultiTermRpts
-
rptGroups
private PhrasePositions[][] rptGroups
-
rptStack
private PhrasePositions[] rptStack
-
positioned
private boolean positioned
-
matchLength
private int matchLength
-
-
Constructor Detail
-
SloppyPhraseMatcher
SloppyPhraseMatcher(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch)
-
-
Method Detail
-
approximation
DocIdSetIterator approximation()
Description copied from class:PhraseMatcher
Approximation that only matches documents that have all terms.- Specified by:
approximation
in classPhraseMatcher
-
impactsApproximation
ImpactsDISI impactsApproximation()
Description copied from class:PhraseMatcher
Approximation that is aware of impacts.- Specified by:
impactsApproximation
in classPhraseMatcher
-
maxFreq
float maxFreq() throws java.io.IOException
Description copied from class:PhraseMatcher
An upper bound on the number of possible matches on this document- Specified by:
maxFreq
in classPhraseMatcher
- Throws:
java.io.IOException
-
reset
public void reset() throws java.io.IOException
Description copied from class:PhraseMatcher
Called afterPhraseMatcher.approximation()
has been advanced- Specified by:
reset
in classPhraseMatcher
- Throws:
java.io.IOException
-
sloppyWeight
float sloppyWeight()
Description copied from class:PhraseMatcher
The slop-adjusted weight of the current match The sum of the slop-adjusted weights is used as the freq for scoring- Specified by:
sloppyWeight
in classPhraseMatcher
-
nextMatch
public boolean nextMatch() throws java.io.IOException
Description copied from class:PhraseMatcher
Find the next match on the current document, returningfalse
if there are none.- Specified by:
nextMatch
in classPhraseMatcher
- Throws:
java.io.IOException
-
captureLead
private void captureLead(PhrasePositions pp) throws java.io.IOException
- Throws:
java.io.IOException
-
startPosition
public int startPosition()
Description copied from class:PhraseMatcher
The start position of the current match- Specified by:
startPosition
in classPhraseMatcher
-
endPosition
public int endPosition()
Description copied from class:PhraseMatcher
The end position of the current match- Specified by:
endPosition
in classPhraseMatcher
-
startOffset
public int startOffset() throws java.io.IOException
Description copied from class:PhraseMatcher
The start offset of the current match- Specified by:
startOffset
in classPhraseMatcher
- Throws:
java.io.IOException
-
endOffset
public int endOffset() throws java.io.IOException
Description copied from class:PhraseMatcher
The end offset of the current match- Specified by:
endOffset
in classPhraseMatcher
- Throws:
java.io.IOException
-
advancePP
private boolean advancePP(PhrasePositions pp) throws java.io.IOException
advance a PhrasePosition and update 'end', return false if exhausted- Throws:
java.io.IOException
-
advanceRpts
private boolean advanceRpts(PhrasePositions pp) throws java.io.IOException
pp was just advanced. If that caused a repeater collision, resolve by advancing the lesser of the two colliding pps. Note that there can only be one collision, as by the initialization there were no collisions before pp was advanced.- Throws:
java.io.IOException
-
lesser
private PhrasePositions lesser(PhrasePositions pp, PhrasePositions pp2)
compare two pps, but only by position and offset
-
collide
private int collide(PhrasePositions pp)
index of a pp2 colliding with pp, or -1 if none
-
initPhrasePositions
private boolean initPhrasePositions() throws java.io.IOException
Initialize PhrasePositions in place. A one time initialization for this scorer (on first doc matching all terms):- Check if there are repetitions
- If there are, find groups of repetitions.
- no repetitions: "ho my"~2
- repetitions: "ho my my"~2
- repetitions: "my ho my"~2
- Returns:
- false if PPs are exhausted (and so current doc will not be a match)
- Throws:
java.io.IOException
-
initSimple
private void initSimple() throws java.io.IOException
no repeats: simplest case, and most common. It is important to keep this piece of the code simple and efficient- Throws:
java.io.IOException
-
initComplex
private boolean initComplex() throws java.io.IOException
with repeats: not so simple.- Throws:
java.io.IOException
-
placeFirstPositions
private void placeFirstPositions() throws java.io.IOException
move all PPs to their first position- Throws:
java.io.IOException
-
fillQueue
private void fillQueue()
Fill the queue (all pps are already placed
-
advanceRepeatGroups
private boolean advanceRepeatGroups() throws java.io.IOException
At initialization (each doc), each repetition group is sorted by (query) offset. This provides the start condition: no collisions.Case 1: no multi-term repeats
It is sufficient to advance each pp in the group by one less than its group index. So lesser pp is not advanced, 2nd one advance once, 3rd one advanced twice, etc.Case 2: multi-term repeats
- Returns:
- false if PPs are exhausted.
- Throws:
java.io.IOException
-
initFirstTime
private boolean initFirstTime() throws java.io.IOException
initialize with checking for repeats. Heavy work, but done only for the first candidate doc.If there are repetitions, check if multi-term postings (MTP) are involved.
Without MTP, once PPs are placed in the first candidate doc, repeats (and groups) are visible.
With MTP, a more complex check is needed, up-front, as there may be "hidden collisions".
For example P1 has {A,B}, P1 has {B,C}, and the first doc is: "A C B". At start, P1 would point to "A", p2 to "C", and it will not be identified that P1 and P2 are repetitions of each other.The more complex initialization has two parts:
(1) identification of repetition groups.
(2) advancing repeat groups at the start of the doc.
For (1), a possible solution is to just create a single repetition group, made of all repeating pps. But this would slow down the check for collisions, as all pps would need to be checked. Instead, we compute "connected regions" on the bipartite graph of postings and terms.- Throws:
java.io.IOException
-
sortRptGroups
private void sortRptGroups(java.util.ArrayList<java.util.ArrayList<PhrasePositions>> rgs)
sort each repetition group by (query) offset. Done only once (at first doc) and allows to initialize faster for each doc.
-
gatherRptGroups
private java.util.ArrayList<java.util.ArrayList<PhrasePositions>> gatherRptGroups(java.util.LinkedHashMap<Term,java.lang.Integer> rptTerms) throws java.io.IOException
Detect repetition groups. Done once - for first doc- Throws:
java.io.IOException
-
tpPos
private final int tpPos(PhrasePositions pp)
Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset)
-
repeatingTerms
private java.util.LinkedHashMap<Term,java.lang.Integer> repeatingTerms()
find repeating terms and assign them ordinal values
-
repeatingPPs
private PhrasePositions[] repeatingPPs(java.util.HashMap<Term,java.lang.Integer> rptTerms)
find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRpts
-
ppTermsBitSets
private java.util.ArrayList<FixedBitSet> ppTermsBitSets(PhrasePositions[] rpp, java.util.HashMap<Term,java.lang.Integer> tord)
bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is set
-
unionTermGroups
private void unionTermGroups(java.util.ArrayList<FixedBitSet> bb)
union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms
-
termGroups
private java.util.HashMap<Term,java.lang.Integer> termGroups(java.util.LinkedHashMap<Term,java.lang.Integer> tord, java.util.ArrayList<FixedBitSet> bb) throws java.io.IOException
map each term to the single group that contains it- Throws:
java.io.IOException
-
-