ZF-7519: WARNING gets risen in Zend_Search_Lucene_Document_Html when using $query->highlightMatches() with Highlighter_Default
After searching a big index with a xxx search term (xxx standing for any short and common phrase), and calling $query->highlightMatches() on all the strings stored in the result,
I sometimes got WARNINGs caused by class Zend_Search_Lucene_Document_Html line 400: $wordsToHighlight = call_user_func_array('array_merge', $wordsToHighlightList);
because $wordsToHighlightList() was an empty array.
Writing a custom highlighter extending Zend_Search_Lucene_Search_Highlighter_Default I was able towork around this by not calling
if $words as an empty array to begin with
I have not found out when and why Zend_Search_Lucene_Search_Highlighter::highlight() was called with an empty array as parameter. when using the $query->highlightMatches() function.
I do think this indicates a problem deeper down.
However Zend_Search_Lucene_Search_Highlighter_Default() should probably catch that case and/or Zend_Search_Lucene_Document_Html::highlightMatches() should.
I had this problem with 1.8.2. however the code differences between 1.8.2and 1.9 are superficial and mostly deal with handling binary access to index files on disk, so I do belief all later versions are affected, too.
I also encountered extreme performance issues using $query->highlightMatches() (as in 30 seconds to highlight matches in around 100 words, using a 28k document index and roughly 100k terms in the index, doubling the search time.)
Result highlighting also seems to have issues if there are multi byte characters in the source. (highlighted areas are offset by the amount of multibyte characters prior to that position)
I finally fixed the issue for me by not using $highlightMatches , but writing my own highlighting routine based on $query->rewrite()->optimize()->->getQueryTerms() and comparing them to tokens from $analyzer->tokenize("to be highlighted result text") ... ->getTermtext() which i then can trace to source string positions.
Its way more performant than $query->highlightMatches(), produces more accurate results, doesnt cause the above bug, and works with non HTML source.
I had used an analyzer that stored multiple tokens for the same position in source string (synonyme indexer) I assume the original bug was caused by more than one of the indexed tokens for the same word in tobehighlighted / originally indexed text matched the search pattern. But that's a wild guess.