New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Hacking Lucene and Solr for Fun and Profit
1.
2. HACKING LUCENE AND
SOLR FOR FUN AND
PROFIT
Grant Ingersoll
CTO, LucidWorks,
grant@lucidworks.com, @gsingers
3. Keyword Search is so yesterday
•
Search is a system building block
– text is only a part of the story
•
If the algorithms fit,
use them!
•
Embrace fuzziness!
•
Scoring features are everywhere
4. Lucene and Solr can do…
•
Classic: Fast, fuzzy text matching across a large document collection
•
Data Quality and Analysis
– Faceting, slicing and dicing of numerical/enumerated data
– Spatial
– Spell checking, record linkage, highlighting
– Stats, Missing fields, etc.
•
Top N problems
5. Topics
• Search Hacks
• “Trust me, I’m a mathematician”
• “I wish I had thought of that” Hack
10. Analysis
•
•
Split into sentences
– Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer
Identify Names using OpenNLP
•
Add Entity marker tokens at the same position as original token
– Could also be done with Payloads
•
Index
•
https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta
mer/solr
https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml
•
11. Search Side
• Custom Query Parser takes in user’s natural language query,
classifies it to find the Answer Type and generates Solr query
• Retrieve candidate passages that match keywords and expected
answer type
• Unlike keyword search, we need to know exactly where matches
occur
• https://github.com/tamingtext/book/tree/master/src/main/java/com/
tamingtext/qa
12. Answer Type Classification
• Answer Type examples:
– Person (P), Location (L), Organization (O), Time Point (T),
Duration (R), Money (M)
– See page 248 for more
• Train an OpenNLP classifier off of a set of previously annotated
questions, e.g.:
– P Which French monarch reinstated the divine right of the
monarchy to France and was known as `The Sun King'
because of the splendour of his reign?
15. kNN and TF/IDF Classification w/ Lucene
https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
16. Lucene Classification Module
•
Builds classifier off of index information
•
See the org.apache.lucene.classification package
•
Naïve Bayes Classifier
•
kNN Classifier
•
Perceptron Classifier
17. Recommenders
•
•
•
•
•
Cross recommendation as search
– with search used to build cross recommendation!
Recommend content to people who exhibit certain behaviors (clicks, query terms,
other)
(Ab)use of a search engine
– but not as a search engine for content
– more like a search engine for behavior
See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation
Algorithms
– http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms
Go get Mahout/Myrrix or just do it in y(our) search engine
22. Time Space Continuum
•
Leverage Solr’s new spatial capabilities to index non-spatial data, such as time
ranges
– Useful for Open Hours, Shifts, etc.
•
Key: multi-valued range data
•
Query using rectangle intersections
– q = shift:"Intersects(0 19 23 365)”
•
Credits to David Smiley and Hoss…
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/