ZF-6899: Can not search chinese string by chinese UTF-8 word sequence


StandardAnalyzer in java lucene could search chinese string by chinese UTF-8 word sequence. For example: There are two document, one is "文件编辑器", another is "编码格式" . When we search string "编辑", the analyzer we use will cut the string into "编" and "辑". We will get both of these two document result in Zend, but in Java lucene, we could get the correct doc - the first document.

Could Zend search consider the relative position in two terms? Thanks!


Chinese is one of the languages that you do not leave space mark between words by a word unit.

Unfortunately , Zend_Search_Lucene seems to not support Chinese, and also incorporating languages, agglutinative languages now.

I think it would be 'postpone'.

Hi Satoru,

Thanks for your reply. I have investigated this issue by comparing Java lucene, and found the root cause is that Zend only uses Zend_Search_Lucene_Search_Query_MultiTerm for each clasue separated by Zend_Search_Lucene_Search_QueryParser(please see Zend\Search\Lucene\Search\QueryEntry\Term.php:line 182).

Is there any way to use Zend_Search_Lucene_Search_Query_Phrase for each sub clause?

I just wrote a patch that - among other things, does exactly that. Though in Zend\Search\Lucene\Search\Query/Preprocessing/Term.php


Previously if a Term search phrase was tokenized into several tokens by the analyzer, Zend\Search\Lucene\Search\Query/Preprocessing/Term.php created a Zend_Search_Lucene_Search_Query_MultiTerm query in its rewrite() function. I changed that into a Zend_Search_Lucene_Search_Query_Phrase

Theoretically that should fix the chinese phrase problem.

Ill run a test on the two supplied strings...

I am obviously missing the analyzer you mentioned that tokenizes each chinese word into a seperate token. Common_Utf8Num just tokenises all of 文件编辑器 into one token which can be matched with * 编辑 * (However with terrible performance due to the leading *)

I still think the patch should in theory fix that, but I need someone with access to that Analyzer to confirm.

Bulk change of all issues last updated before 1st January 2010 as "Won't Fix".

Feel free to re-open and provide a patch if you want to fix this issue.

Bug has been closed by default in "Bulk change". However patch is already attached to bug (in this case to ZF-7738), what has been missing is a maintainer