Apache Solr

Enterprise search with Solr Minh Tran

Why does search matter? Then: Most of the data encountered created for the web Heavy use of a site ‘s search function considered a failure in navigation Now: Navigation not always relevant Less patience to browse Users are used to “navigation by search box” Confidential 2

What is SOLR Open source enterprise search platform based on Apache Lucene project. REST-like HTTP/XML and JSONAPIs Powerful full-text search, hit highlighting, faceted search Database integration, and rich document (e.g., Word, PDF) handling Dynamic clustering, distributed search and index replication Loose Schema to define types and fields Written in Java5, deployable as a WAR Confidential 3

Public Websites using Solr Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix See here for more information: http://wiki.apache.org/solr/PublicServers Confidential 4

Architecture 5 Admin Interface HTTP Request Servlet Update Servlet Standard Request Handler Disjunction Max Request Handler Custom Request Handler XML Update Interface XML Response Writer Solr Core Update Handler Caching Config Schema Analysis Concurrency Lucene Replication Confidential

Starting Solr We need to set these settings for SOLR: solr.solr.home: SOLR home folder contains conf/solrconfig.xml solr.data.dir: folder contains index folder Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory. For e.g: java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty) Other web server, set these values by setting Java properties Confidential 6

Web Admin Interface Confidential 7

How Solr Sees the World An index is built of one or more Documents A Document consists of one or more Fields Documents are composed of fields A Field consists of a name, content, and metadata telling Solr how to handle the content. You can tell Solr about the kind of data a field contains by specifying its field type Confidential 9

Field Analysis Field analyzers are used both during ingestion, when a document is indexed, and at query time An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes. Tokenizersbreak field data into lexical units, or tokens Example: Setting all letters to lowercase Eliminating punctuation and accents, mapping words to their stems, and so on “ram”, “Ram” and “RAM” would all match a query for “ram” Confidential 10

Schema.xml schema.xml file located in ../solr/conf schema file starts with <schema> tag Solr supports one schema per deployment The schema can be organized into three sections: Types Fields Other declarations 11

Example for TextField type Confidential 12

Filter explanation StopFilterFactory: Tokenize on whitespace, then removed any common words WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc. LowerCaseFilterFactory: lowercase all terms. EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm. E.g: “runs, running, ran”  its elemental root "run" RemoveDuplicatesTokenFilterFactory: Remove any duplicates: Confidential 13

Field Attributes Indexed: Indexed Fields are searchable and sortable. You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. Stored: The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. Confidential 14

Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms Dynamic Fields, in the spirit of Lucene! <dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/> <dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/> <dynamicFieldname="*_t" type="text“ indexed="true" stored="true"/> 15

Other declaration <uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply Confidential 16

Indexing data Using curl to interact with Solr: http://curl.haxx.se/download.html Here are different data formats: Solr'snative XML CSV (Character Separated Value) Rich documents through SolrCell JSON format Direct Database and XML Import through Solr'sDataImportHandler Confidential 17

Add / Update documents HTTP POST to add / update <add> <doc boost=“2”> <field name=“article”>05991</field> <field name=“title”>Apache Solr</field> <field name=“subject”>An intro...</field> <field name=“category”>search</field> <field name=“category”>lucene</field> <field name=“body”>Solr is a full...</field> </doc> </add> Confidential 18

Delete documents Delete by Id <delete><id>05591</id></delete> Delete by Query (multiple documents) <delete><query>manufacturer:microsoft</query></delete> Confidential 19

Commit / Optimize <commit/> tells Solr that all changes made since the last commit should be made available for searching. <optimize/> same as commit. Merges all index segments. Restructures Lucene 's files to improve performance for searching. Optimization is generally good to do when indexing has completed If there are frequent updates, you should schedule optimization for low-usage times An index does not need to be optimized to work properly. Optimization can be a time-consuming process. Confidential 20

Index XML documents Use the command line tool for POSTing raw XML to a Solr Other options: -Ddata=[files|args|stdin] -Durl=http://localhost:8983/solr/update -Dcommit=yes (Option default values are in red) Example: java -jar post.jar *.xml java -Ddata=args -jar post.jar "<delete><id>42</id></delete>" java -Ddata=stdin -jar post.jar java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>" Confidential 21

Index CSV file usingHTTP POST curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML. Example: using HTTP-POST to send the CSV data over the network to the Solr server: curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml Confidential 22

Index CSV usingremote streaming Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work. Change enableRemoteStreaming="true“ in solrconfig.xml: <requestParsersenableRemoteStreaming="false“multipartUploadLimitInKB="2048"/> ,[object Object],java -Ddata=args -Durl=http://localhost:9090/solr/update -jar post.jar "<commit/>" curl http://localhost:9090/solr/update/csv -F "stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv" -F "commit=true" –F "optimize=true" -F "stream.contentType=text/plain;charset=utf-8" curl "http://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv&commit=true&optimize=true" Confidential 23

Index rich document withSolr Cell Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others Example: curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html" curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html) Capture <div> tags separate, and then map that field to a dynamic field named foo_t: curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F tutorial=@tutorial.pdf (index pdf) Confidential 24

Updating a Solr Index with JSON The JSON request handler needs to be configured in solrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> Example: curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json" Confidential 25

Searching Spellcheck Editorial results replacement Scaling index size with distributed search Confidential 26

Default Query Syntax Lucene Query Syntax [; sort specification] mission impossible; releaseDatedesc +mission +impossible –actor:cruise “mission impossible” –actor:cruise title:spiderman^10 description:spiderman description:“spiderman movie”~10 +HDTV +weight:[0 TO 100] Wildcard queries: te?t, te*t, test* Confidential 27

Default Parameters Query Arguments for HTTP GET/POST to /select Confidential 28

Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price <response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result> </response> 29

Query response writers query responses will be written using the 'wt' request parameter matching the name of a registered writer. The "default" writer is the default and will be used if 'wt' is not specified in the request E.g.: http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true Confidential 30

Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers 31

Configuring Relevancy <fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> </analyzer> </fieldtype> 32

Faceted Browsing 34 computer_type:PC proc_manu:Intel = 594 memory:[1GB TO *] proc_manu:AMD intersection Size() = 382 computer price asc Search(Query,Filter[],Sort,offset,n) price:[0 TO 500] = 247 price:[500 TO 1000] section of ordered results = 689 Unordered set of all results manu:Dell = 104 DocList DocSet manu:HP = 92 manu:Lenovo = 75 Query Response

Index optimization Confidential 35

Updater High Availability Dynamic HTML Generation Appservers HTTP search requests Load Balancer Solr Searchers Index Replication admin queries DB updates updates admin terminal Solr Master

Distributed and replicated Solr architecture Confidential 37

Index by using SolrJ Confidential 38

Query with SolrJ Confidential 39

Distributed and replicated Solrarchitecture (cont.) At this time, applications must still handle the process of sending the documents to individual shards for indexing The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents. Confidential 40

Advance Functionality Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL) Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…) Support for NoSQL database like MongoDB, Cassandra? 41

Other open source server Sphinx Elastic Search Confidential 42

http://wiki.apache.org/solr/ExtractingRequestHandler

http://lucene.apache.org/tika/

Apache Solr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache Solr

Similar to Apache Solr (20)

Apache Solr