Issues

ZF-2982: Keyword field are not searchable using the QueryParser if not lowercase

Description

I have to enter my keyword lowercase to find them with the QueryParser, otherwise I get no results.

How to reproduce:

<?php require 'Zend/Search/Lucene.php'; $index = Zend_Search_Lucene::create('test'); $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::Keyword('firstname', 'Patrick')); $doc->addField(Zend_Search_Lucene_Field::Keyword('lastname', 'Allaert')); $index->addDocument($doc); // ------------> OOPS, expecting "1" in both cases echo count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('Patrick'))) . "\n"; // print "0" echo count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('patrick'))) . "\n"; // print "0" $index = Zend_Search_Lucene::create('test2'); $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::Keyword('firstname', 'patrick')); $doc->addField(Zend_Search_Lucene_Field::Keyword('lastname', 'allaert')); $index->addDocument($doc); // OK, here i correctly have "1" in both cases echo count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('Patrick'))) . "\n"; // print "1" echo count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('patrick'))) . "\n"; // print "1" ?>

Comments

Can you please assess and categorize as necessary?

Test reproducing this bug

I could not reproduce this bug exactly as the OP shows; I get a slightly different result:

These unit tests pass against trunk:


Index: tests/Zend/Search/Lucene/SearchTest.php
===================================================================
--- tests/Zend/Search/Lucene/SearchTest.php     (revision 24462)
+++ tests/Zend/Search/Lucene/SearchTest.php     (working copy)
@@ -486,4 +486,36 @@

         Zend_Search_Lucene::setResultSetLimit($storedResultSetLimit);
     }
+
+    /**
+     * @group ZF-2982
+     */
+    public function testKeywordFieldSearchableUsingQueryParserWhenTermNotLowercase()
+    {
+        $index = Zend_Search_Lucene::create('test');
+
+        $doc = new Zend_Search_Lucene_Document();
+        $doc->addField(Zend_Search_Lucene_Field::Keyword('firstname', 'Patrick'));
+
+        $index->addDocument($doc);
+
+        $this->assertTrue(count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('Patrick'))) == 1);
+        $this->assertTrue(count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('patrick'))) == 0);
+    }
+
+    /**
+     * @group ZF-2982
+     */
+    public function testKeywordFieldSearchableUsingQueryParserWhenTermIsLowercase()
+    {
+        $index = Zend_Search_Lucene::create('test');
+
+        $doc = new Zend_Search_Lucene_Document();
+        $doc->addField(Zend_Search_Lucene_Field::Keyword('firstname', 'patrick'));
+
+        $index->addDocument($doc);
+
+        $this->assertTrue(count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('Patrick'))) == 1);
+        $this->assertTrue(count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('patrick'))) == 1);
+    }
 }

So, when the keyword contains a capital letter ('Patrick') the search is exact ('Patrick' matches, 'patrick' doesn't match). However, when the keyword is all lowercase ('patrick') the search is not exact (both 'Patrick' and 'patrick' match)

ie: I added this unit test to prove the capital-letter point, and it passes:


/**
 * @group ZF-2982
 */
public function testKeywordFieldSearchableUsingQueryParserWhenTermNotLowercase2()
{
    $index = Zend_Search_Lucene::create('test');

    $doc = new Zend_Search_Lucene_Document();
    $doc->addField(Zend_Search_Lucene_Field::Keyword('firstname', 'paTrIcK'));        
    $index->addDocument($doc);

    $this->assertTrue(count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('Patrick'))) == 0);
    $this->assertTrue(count($index->find(Zend_Search_Lucene_Search_QueryParser::parse('patrick'))) == 0);
}

'paTrIcK' matches neither 'Patrick' nor 'patrick'.

I believe the issue stems from the way case-insensitive searches are performed: {quote} The standard analyzer may transform the query string to lower case for case-insensitivity, remove stop-words, and stem among other transformations.

The API method doesn't transform or filter input terms in any way. It's therefore more suitable for computer generated or untokenized fields. {quote} (source: http://framework.zend.com/manual/en/…)

The simplest workaround is to make sure that all your keyword field values are lowercased (strtolower).

The current setup appears to be that if the index keyword value is all lowercase, the match is treated as case-insensitive, and if it contains any upper-case characters it's treated as case-sensitive. Seems counter-intuitive, but is this a bug? If so, is it something that should be fixed in ZFv1 so late in it's lifecycle? Or should we push the issue up to ZF2 for resolution? It's been around since at least 1.5, so chances are anyone encountering it have workarounds for it already in place.

Thoughts?

I indeed have a work around named "Solr". Zend Search Lucene is definetely not an option for anyone serious about using search engine technology. A workaround is difficult as you may not know whether your user, when entering a term, is about to search case sensitively or not. So yes, I consider this is a bug. But honestly, I don't care anymore about Zend_Search :)