The document provides an introduction to Lucene, an open-source text search engine library written in Java. It discusses Lucene's history and architecture at a high level, how it parses query terms and fields, and supports modifiers and Boolean operators to connect terms. The summary also lists some common sub-projects built with Lucene like Solr.
3. Lucene简介
基于 Java 的全文信息检索工具包
Lucene不是一个完整的全文索引应擎 ,而是一个用Java
写的全文索引引擎工具包,它可以方便的嵌入到各种应用
中实现针对应用的全文索引/检索功能。
Lucene历史
Doug Cutting Created in 1999
g g
Donated to Apache in 2001
•Company Logo
www.nfschina.com
4. Powered by Lucene
Technorati
T h ti
Wikipedia
Internet Archive
LinkedIn
monster.com
monster com
Eclipse的帮助搜索
Jira
SouFun
IBM Omnifind Y! Edition
…
http://wiki.apache.org/lucene-java/PoweredBy
5. Lucene Sub-projects
Nutch
Web crawler with document parsing
Hadoop
Distributed data processor
p
Implements MapReduce
Solr
19. What Is Solr
Solr是一个基于Lucene的Java搜索引擎服务器。Solr 提供了层面搜
索、命中醒目显示并且支持多种输出格式(包括 XML/XSLT 和
JSON 格式) 它易于安装和配置 而且附带了 个基于 HTTP 的管
格式)。它易于安装和配置,而且附带了一个基于
理界面。
Solr历史
Yonik Seeley Developed at CNET
Donated to Apache in 2006
19
20. Solr 特点
Features
Servlet
Web Administration Interface
XML/HTTP, JSON Interfaces
Faceting
F ti
Schema to define types and fields
Hi hli hti
Highlighting
Caching
Index R li ti
I d Replication (Master / Slaves)
(M t Sl )
Pluggable
Java 5
21. Powered by Solr
Netflix
CNET
AOL:sports and music
AOL t d i
Shopper.com
Drupal module
GameSpot
Reddit
Instructables
http://news.com
…
http://wiki.apache.org/solr/PublicServers
31. 管 功能
Solr管理界面功能
Show Config, Schema, Distribution info
Sh C fi S h Di t ib ti i f
Query Interface
Statistics
Caches: lookups, hits, hitratio, inserts, evictions,
size
i
RequestHandlers: requests, errors
UpdateHandler: adds, deletes, commits, optimizes
adds deletes commits
IndexReader, open-time, index-version, numDocs,
maxDocs,
Analysis Debugger
Shows tokens after each Analyzer stage
Shows token matches for query vs index
31
33. Schema
Lucene has no notion of a schema
Sorting - string vs. numeric
Ranges - val:42 included in val:[1 TO 5] ?
Lucene QueryParser has date-range support, but must
guess.
Defines fields, their types, p p
yp properties
Defines unique key field, default search field,
Similarity implementation
33
38. Solr-Add操作
HTTP POST to http://localhost:8080/solr/update/
<add>
<doc>
<field name=quot;employeeIdquot;>05991</field>
<field name=quot;officequot;>Bridgewater</field>
<field name=quot;skillsquot;>Perl</field>
<field name=quot;skillsquot;>Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>
Documents or fields can have boosts attached
39. Solr-Update / Delete操作
Inserting a document with already present
uniqueKey will erase the original
Deleting
By uniqueKey field
<delete><id>05991</id></delete>
By query
<delete><query>name:Anthony</query></delete>
<Commit/>
<Optimize/>
41. Default Parameters
Query Arguments for HTTP GET/POST to /select
param default description
q The query
start 0 Offset into the list of matches
rows 10 Number f documents t return
N b of d t to t
fl * Stored fields to return
qt standard Query type; maps to query
handler
df (schema) Default field to search
41
47. Caching
IndexSearcher’s view of an index is fixed
Aggressive caching possible
Consistency for multi-query requests
filterCache – unordered set of document ids matching a query
resultCache – ordered subset of document ids matching a query
documentCache – the stored fields of documents
userCaches – application specific, custom query handlers
47
48. Warming for Speed
Lucene IndexReader warming
field norms, FieldCache, tii – the term index
Static Cache warming
Configurable static requests to warm new Searchers
g q
Smart Cache Warming (autowarming)
Using MRU items in the current cache to pre populate the
pre-populate
new cache
Warming in parallel with live requests
48
51. Configuring Relevancy
<fieldtype name=quot;textquot; class=quot;solr.TextFieldquot;>
<analyzer>
<tokenizer class=quot;solr.WhitespaceTokenizerFactoryquot;/>
<filter class=quot;solr.LowerCaseFilterFactoryquot;/>
f C /
<filter class=quot;solr.SynonymFilterFactoryquot;
synonyms=quot;synonyms.txt“/>
<filter class=quot;solr.StopFilterFactory“
class solr.StopFilterFactory
words=“stopwords.txt”/>
<filter class=quot;solr.EnglishPorterFilterFactoryquot;
protected=quot;protwords.txtquot;/>
</analyzer>
/ l
</fieldtype>
51
52. copyField
Copies one field to another at index time
Usecase: Analyze same field different ways
copy into a field with a different analyzer
boost exact-case, exact-punctuation matches
language translations, thesaurus, soundex
<field name=“title” type=“text”/>
<field name=“title exact” type=“text exact” stored=“false”/>
field name title_exact type text_exact stored false /
<copyField source=“title” dest=“title_exact”/>
Usecase: I d multiple fields into single searchable
U Index lti l fi ld i t i l h bl
field
52
53. High Availability •Dynamic
HTML
•Appservers Generation
•HTTP search
•Load Balancer
requests
•Solr Searchers
•Index Replication
Index
•admin queries
•updates
•DB
•updates •Updater
•admin terminal •Solr Master
53
54. Replication
•Master •Searcher
•solr/data/index •solr/data/index
•after
mv
•new segment
•Lucene index segments
Lucene
•1. hard links •2. hard links •4. mv dir
•after
•3. rsync rsync
•solr/data/snapshot-2006062950000 •solr/data/snapshot-2006062950000-WIP
54
55. Resources
WWW
http://wiki.apache.org/solr/
http://www.ibm.com/developerworks/cn/java/j-solr1/
http://www.ibm.com/developerworks/cn/java/j-solr2/
http://www.xml.com/pub/a/2006/08/09/solr indexing xml with lucene
http://www.xml.com/pub/a/2006/08/09/solr-indexing-xml-with-lucene-
andrest.html?page=1
http://lucene.apache.org/java/docs/queryparsersyntax.html
http://www.blogjava.net/RongHao/archive/2007/11/06/158621.html
Mailing Lists
solr-user-subscribe@lucene.apache.org
solr-dev-subscribe@lucene.apache.org
55