Zoe Slattery's slides from PHPNW08:
The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.
For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.
In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Search Lucene
1. Can you be dynamic and fast? “ Miss Marple and the case of the Missing MIPS” Zoë Slattery
2.
3.
4.
5. Lucene [2] 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005. DB Web File system Get user query Present search results Index Index documents Search index Gather data Lucene Application User
6. Lucene indexing start 3. Inverted index 1. Documents Analysis Index creation Optimise 4. Optimised inverted index . Oh for a muse of fire that would acsend the brightest heaven of invention..... fire ascend ... Henry V, Scouting for boys... Aerospace, Henry V... Terms Documents end [fire] [ascend] [bright] [heaven] 2. Token stream
19. look at the normalize() code public function normalize(Token $srcToken ) { $newToken = new Token( strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
29. look at the call to normalize() $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) { $newToken = new Token(strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
30. look at the call to normalize() normalize() recoded.... $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize (Token $srcToken) { $ srcToken->setTermText(strtolower($srcToken->getTermtext())); return $srcToken; }
32. Performance improvement? PHP + fix PHP 151 167 Time to index /seconds 43 43 Time to optimise /seconds Java 32 3 35 194 210 Total time 9.5 % improvement Java + JIT 4 0.3 4.3
33.
34.
35. Options for PHP Y Y Y N N N N Y 5. http://pecl.php.net/package/clucene Do you care about speed? Use Zend Search Lucene Only need basic features? Can support Java environment? Use a Web Service? Use Lucene via a Java bridge No Lucene solution today [5] Use SOLR as web service