ZF-7736: Re-design of Highlight Term gathering by use of a temporary in-memory Indexer (patch)

Description

Search result highlighting in ZEND Lucene is a problem.

The way Lucene gathers to be highlighted strings, by cascading through internal highlightMatches() methods in all the different Zend_Search_Lucene_Search_Query... classes, often re-tokenizing the entire to be highlighted text, is very slow, error prone, and usually duplicates code from the rewrite() methods in the same classes.

This method of doing things is also is a blocker for some nifty features I did with other patches, dealing with multi-token ability (among th cool features related to stem and synonym searches that are listed in the current code behind @todo: lines) since they require calls to the rewrite() method of newly created sub queries, which is no problem when coming from a rewrite() method, but does not work within _highlightMatches() due to the unavailability of an indexer.

No longer !

This patch re-designs how highlighting terms are gathered, simply using rewrite()->getQueryTerms().

To be able to rewrite in the first place, the to be highlighted text needs to be stored in an index. While in 99.99% of all cases those texts would be in the main search index, the class definition calls for the highlight function to be index independent.

Rightfully so :)

Therefore I wrote a "dummy index class", which implements the whole Zend_Search_Lucene_Interface, but does not write to a real Lucene index on disk, instead stores an index simulated by PHP assotiative arrays in memory.

class Zend_Search_Lucene_TempIndex

I gave Query->highlightMatches an optional 5th parameter to supply a different index to be used in rewrite(). Without that, the TempIndex is instantiated, the to be highlighted document added with ->addDocument, and then used for rewrite().

Then I removed all the _highlightMatches() functions from all Query Classes ;)

Regards

Corvus Corax

Comments

While reading ZF-2598 I just noticed that it might be necessary to use a different analyzer for indexing of the to-be highlighted text and for rewrite()-ing the search query to get the highlighting tokens.

Would it make sense to give $query->highlightMatches() an additional parameter to override the analyzer used to tokenize the html string?

What do you think? Would be an easy enough change to this patch.

Bulk change of all issues last updated before 1st January 2010 as "Won't Fix".

Feel free to re-open and provide a patch if you want to fix this issue.

Bug has been closed by default in "Bulk change". However patch is already attached to bug, what has been missing is a maintainer