A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.
6. Key-Value Store
Filesystem
Berkeley DB } unstructured
MySQL - structured
7. Foreign Key (RDBMS)
SQLite
MySQL related content
Postgres through JOINs
Oracle over
... structured data
8. Search Engines
Solr (Lucene) Denormalized,
Xapian Inverted Index
(Whoosh) over unstructured/
semi-structured data
9. Other routes to full-text search
http://www.postgresql.org/docs/8.4/static/textsearch.html
http://code.google.com/p/djangosearch/
http://www.sphinxsearch.com/
10. Solr: HTTP interface to Lucene
Lucene written by Doug Cutting (HADOOP),
first release 2001.
Solr in-house CNET project, open-sourced in 2006
Solr 1.4, Lucene 3.0 released November 2009
Solr + Lucene merged in March 2010
Next version - 1.5/3.1/4.0 - not for production use yet.
11. Solr RDBMS
Index Table
composed of composed of
Documents Rows
ALL DOCUMENTS HAVE
THE SAME STRUCTURE
12. •Optional columns Document Field options
•Denormalized data Entity type required
Title required
Identifier uniqueKey
Pub. Frequency
Book Magazine Person
Associated
multiValued
Title Title First name name
multiValued,
Default Search
ISBN ISSN Last name default
Author Publication
(FK Person) Frequency copyField Title
Editor
(FK Person) Associated
Default Search
Name
Contributer
(M2M Person)
13. There is no update, only overwrite!!!
Book Book
Solar Solr 1.4
Enterprise Enterprise
Search Server Search Server
Identifier Identifier
Pub. Freq. Pub. Freq.
David Smiley, David Smiley,
Eric Pugh Eric Pugh
Solr can't overwrite without a uniqueKey
14. Schema design
<field name="title"
text
type="text"
int
indexed="true"
long
stored="true"
float
required="true"
double
multiValued="false"
date
/>
query
What do you want to search on?
What do you want to do with results?
16. GET http://localhost:8983/solr/select/?q=searchterm
GET http://localhost:8983/solr/current/select/?
fq=private
%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2
0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags
%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A
%22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR
+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA
%22+AND+tags%3A%22united+kingdom%22+AND+is_index
%3Atrue%5E100%29
17. Need ORM equivalent (OIM?)
Sunburnt:
http://timetric.com/about/opensource/#sunburnt
http://github.com/tow/sunburnt
http://haystacksearch.org/
(cleaves close to Django, not schema-driven)
21. Scaling to a million pages ...
- talk to the Guardian (Content API)
Decouple read/write
Re-indexing/optimizing strategies
FieldType/Analyzer/Tokenizer tweaks
22. Decouple read/write
Separate processes - many readers, single write pipeline.
Beware multiple writers!
Remember standard DB practice -
write to master, read from slave.
23. Add
Index documents
Index
Fast Index
Commit
Index
Index
Warm up
facet cache
Index
Optimize
26. "UK crime: Betting, gaming and lotteries (year ending 5th April)"
Tokenizer Betting
Analyzer
(Porter stemmer)
bet
Tokenizer
(character filter)
BE,T
Tokenizer
(whitespace)
Belgium, Unemployment rate by gender, Total (BE,T)
27. In the small
Understand Solr schemas - build one for your data.
how do you want to query?
how do you want to show results?
In the large
Understand Solr architecture - build around your data-flow.
how/when do you want to read/write?
what shape/characteristics does your corpus have