2. Why does search matter? Then: Most of the data encountered created for the web Heavy use of a site ‘s search function considered a failure in navigation Now: Navigation not always relevant Less patience to browse Users are used to “navigation by search box” Confidential 2
3. What is SOLR Open source enterprise search platform based on Apache Lucene project. REST-like HTTP/XML and JSONAPIs Powerful full-text search, hit highlighting, faceted search Database integration, and rich document (e.g., Word, PDF) handling Dynamic clustering, distributed search and index replication Loose Schema to define types and fields Written in Java5, deployable as a WAR Confidential 3
4. Public Websites using Solr Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix See here for more information: http://wiki.apache.org/solr/PublicServers Confidential 4
5. Architecture 5 Admin Interface HTTP Request Servlet Update Servlet Standard Request Handler Disjunction Max Request Handler Custom Request Handler XML Update Interface XML Response Writer Solr Core Update Handler Caching Config Schema Analysis Concurrency Lucene Replication Confidential
6. Starting Solr We need to set these settings for SOLR: solr.solr.home: SOLR home folder contains conf/solrconfig.xml solr.data.dir: folder contains index folder Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory. For e.g: java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty) Other web server, set these values by setting Java properties Confidential 6
9. How Solr Sees the World An index is built of one or more Documents A Document consists of one or more Fields Documents are composed of fields A Field consists of a name, content, and metadata telling Solr how to handle the content. You can tell Solr about the kind of data a field contains by specifying its field type Confidential 9
10. Field Analysis Field analyzers are used both during ingestion, when a document is indexed, and at query time An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes. Tokenizersbreak field data into lexical units, or tokens Example: Setting all letters to lowercase Eliminating punctuation and accents, mapping words to their stems, and so on “ram”, “Ram” and “RAM” would all match a query for “ram” Confidential 10
11. Schema.xml schema.xml file located in ../solr/conf schema file starts with <schema> tag Solr supports one schema per deployment The schema can be organized into three sections: Types Fields Other declarations 11
13. Filter explanation StopFilterFactory: Tokenize on whitespace, then removed any common words WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc. LowerCaseFilterFactory: lowercase all terms. EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm. E.g: “runs, running, ran” its elemental root "run" RemoveDuplicatesTokenFilterFactory: Remove any duplicates: Confidential 13
14. Field Attributes Indexed: Indexed Fields are searchable and sortable. You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. Stored: The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. Confidential 14
15. Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms Dynamic Fields, in the spirit of Lucene! <dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/> <dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/> <dynamicFieldname="*_t" type="text“ indexed="true" stored="true"/> 15
16. Other declaration <uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply Confidential 16
17. Indexing data Using curl to interact with Solr: http://curl.haxx.se/download.html Here are different data formats: Solr'snative XML CSV (Character Separated Value) Rich documents through SolrCell JSON format Direct Database and XML Import through Solr'sDataImportHandler Confidential 17
18. Add / Update documents HTTP POST to add / update <add> <doc boost=“2”> <field name=“article”>05991</field> <field name=“title”>Apache Solr</field> <field name=“subject”>An intro...</field> <field name=“category”>search</field> <field name=“category”>lucene</field> <field name=“body”>Solr is a full...</field> </doc> </add> Confidential 18
19. Delete documents Delete by Id <delete><id>05591</id></delete> Delete by Query (multiple documents) <delete><query>manufacturer:microsoft</query></delete> Confidential 19
20. Commit / Optimize <commit/> tells Solr that all changes made since the last commit should be made available for searching. <optimize/> same as commit. Merges all index segments. Restructures Lucene 's files to improve performance for searching. Optimization is generally good to do when indexing has completed If there are frequent updates, you should schedule optimization for low-usage times An index does not need to be optimized to work properly. Optimization can be a time-consuming process. Confidential 20
21. Index XML documents Use the command line tool for POSTing raw XML to a Solr Other options: -Ddata=[files|args|stdin] -Durl=http://localhost:8983/solr/update -Dcommit=yes (Option default values are in red) Example: java -jar post.jar *.xml java -Ddata=args -jar post.jar "<delete><id>42</id></delete>" java -Ddata=stdin -jar post.jar java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>" Confidential 21
22. Index CSV file usingHTTP POST curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML. Example: using HTTP-POST to send the CSV data over the network to the Solr server: curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml Confidential 22
23.
24. Index rich document withSolr Cell Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others Example: curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html" curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html) Capture <div> tags separate, and then map that field to a dynamic field named foo_t: curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F tutorial=@tutorial.pdf (index pdf) Confidential 24
25. Updating a Solr Index with JSON The JSON request handler needs to be configured in solrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> Example: curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json" Confidential 25
30. Query response writers query responses will be written using the 'wt' request parameter matching the name of a registered writer. The "default" writer is the default and will be used if 'wt' is not specified in the request E.g.: http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true Confidential 30
31. Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers 31
40. Distributed and replicated Solrarchitecture (cont.) At this time, applications must still handle the process of sending the documents to individual shards for indexing The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents. Confidential 40
41. Advance Functionality Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL) Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…) Support for NoSQL database like MongoDB, Cassandra? 41