[1] T. Strohman, H. Turtle, and B. Croft. Optimization strategies for complex queries. In Proceedings of ACM SIGIR conference, 2005.
[2] K. Chakrabarti, S. Chaudhuri, V. Ganti. Interval-Based Pruning for Top-k Processing over Compressed Lists, in Proc. of ICDE, 2011.
[3] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, J. Y. Zien. Efficient Query Evaluation using a Two-Level Retrieval Process, in Proc. of CIKM, 2003.
[4] S. Ding and T. Suel. Faster top-k document retrieval using block-max indexes. SIGIR, 2011.
How much faster? - AND query
How much faster? - OR query (1)
How much faster? - OR query (2)
How much faster? - Term query
Algorithms
Posting list retrieval and the challenge on disjunction
Query "search OR engine"
MaxScore
Introduced by H.R.Turtle and J.Flood in 1995
Interval-based pruning
MaxScore variant adopted to block compressed indexes [2]
WAND
Special operator proposed in [3]
"WAND" is the abbreviation for "Week AND" or "Weighted AND"
OR is being close to AND when a document contains a large enough subset of the query terms
Score of a document having a large subset of the query terms is higher than the ones of documents with a few of them
Sounds familiar?
Lucene already has similar concept :
"Minimum Should Match"
WAND
Query "the OR search OR engine OR lucene"
WAND
Steps
Assume current threshold (kth highest score) is 12.
Sort postings by current pointer.
Find "pivot" term and docid - here, that is "search" and id=486.
Calculate the partial score for doc 486 if it also contains "the" and "engine".
Block-max WAND
WAND variant working with block compressed indexes [4]
Finally come in Lucene!
Block-max WAND
Dive into Implementation
Disclaimer
This is about low-level, complex part of Lucene. Could include mistakes... 🙂
Lucene API can be rapidly changed. This is based on branch_8_0 branch.