Issues

ZF-7519: WARNING gets risen in Zend_Search_Lucene_Document_Html when using $query->highlightMatches() with Highlighter_Default

Description

After searching a big index with a xxx search term (xxx standing for any short and common phrase), and calling $query->highlightMatches() on all the strings stored in the result,

I sometimes got WARNINGs caused by class Zend_Search_Lucene_Document_Html line 400: $wordsToHighlight = call_user_func_array('array_merge', $wordsToHighlightList);

because $wordsToHighlightList() was an empty array.

Writing a custom highlighter extending Zend_Search_Lucene_Search_Highlighter_Default I was able towork around this by not calling

$this->_doc->highlight($words, $color);

if $words as an empty array to begin with

I have not found out when and why Zend_Search_Lucene_Search_Highlighter::highlight() was called with an empty array as parameter. when using the $query->highlightMatches() function.

I do think this indicates a problem deeper down.

However Zend_Search_Lucene_Search_Highlighter_Default() should probably catch that case and/or Zend_Search_Lucene_Document_Html::highlightMatches() should.

I had this problem with 1.8.2. however the code differences between 1.8.2and 1.9 are superficial and mostly deal with handling binary access to index files on disk, so I do belief all later versions are affected, too.

I also encountered extreme performance issues using $query->highlightMatches() (as in 30 seconds to highlight matches in around 100 words, using a 28k document index and roughly 100k terms in the index, doubling the search time.)

Result highlighting also seems to have issues if there are multi byte characters in the source. (highlighted areas are offset by the amount of multibyte characters prior to that position)

I finally fixed the issue for me by not using $highlightMatches , but writing my own highlighting routine based on $query->rewrite()->optimize()->->getQueryTerms() and comparing them to tokens from $analyzer->tokenize("to be highlighted result text") ... ->getTermtext() which i then can trace to source string positions.

Its way more performant than $query->highlightMatches(), produces more accurate results, doesnt cause the above bug, and works with non HTML source.

Important info:

I had used an analyzer that stored multiple tokens for the same position in source string (synonyme indexer) I assume the original bug was caused by more than one of the indexed tokens for the same word in tobehighlighted / originally indexed text matched the search pattern. But that's a wild guess.

Comments

the xxx is supposed to be xxx as in lucene query languag. unfortunately this bug tracker software made that bold "xxx"

patch ZF-7736 should fix this issue, since the highlighter will always be called with non-empty arguments.

However it might still be good practice to catch the case of an empty argument in the highlighter itself.

This patch should make the highlight function safe, it checks for emptyness before giving it to user_func_array('array_merge')...

I wonder - what purpose has this specific construct anyway? Why use call_user-func_array?

Could any code comments shed some light on that?

Bulk change of all issues last updated before 1st January 2010 as "Won't Fix".

Feel free to re-open and provide a patch if you want to fix this issue.

Bug has been closed by default in "Bulk change". However patch is already attached to bug, what has been missing is a maintainer