SlideShare a Scribd company logo
1 of 46
Enterprise search with Solr Minh Tran
Why does search matter? Then: Most of the data encountered created for the web Heavy use of a site ‘s search function considered a failure in navigation Now: Navigation not always relevant Less patience to browse Users are used to “navigation by search box” Confidential 2
What is SOLR Open source enterprise search platform based on Apache Lucene project. REST-like HTTP/XML and JSONAPIs Powerful full-text search, hit highlighting, faceted search Database integration, and rich document (e.g., Word, PDF) handling Dynamic clustering, distributed search and index replication Loose Schema to define types and fields Written in Java5, deployable as a WAR Confidential 3
Public Websites using Solr Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix See here for more information: http://wiki.apache.org/solr/PublicServers Confidential 4
Architecture 5 Admin Interface HTTP Request Servlet Update Servlet Standard Request Handler Disjunction Max Request Handler Custom Request Handler XML Update  Interface XML Response Writer Solr Core Update  Handler Caching Config Schema Analysis Concurrency Lucene Replication Confidential
Starting Solr We need to set these settings for SOLR: solr.solr.home: SOLR home folder contains conf/solrconfig.xml solr.data.dir: folder contains index folder Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory.  For e.g: java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty) Other web server, set these values by setting Java properties Confidential 6
Web Admin Interface Confidential 7
Confidential 8
How Solr Sees the World An index is built of one or more Documents A Document consists of one or more Fields Documents are composed of fields A Field consists of a name, content, and metadata telling Solr how to handle the content. You can tell Solr about the kind of data a field contains by specifying its field type Confidential 9
Field Analysis Field analyzers are used both during ingestion, when a document is indexed, and at query time An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes. Tokenizersbreak field data into lexical units, or tokens Example:  Setting all letters to lowercase Eliminating punctuation and accents, mapping words to their stems, and so on “ram”, “Ram” and “RAM” would all match a query for “ram” Confidential 10
Schema.xml schema.xml file located in ../solr/conf schema file starts with <schema> tag Solr supports one schema per deployment The schema can be organized into three sections: Types Fields Other declarations 11
Example for TextField type Confidential 12
Filter explanation StopFilterFactory: Tokenize on whitespace, then removed any common words WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc. LowerCaseFilterFactory: lowercase all terms. EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm. E.g: “runs, running, ran”  its elemental root "run" RemoveDuplicatesTokenFilterFactory: Remove any duplicates: Confidential 13
Field Attributes Indexed: Indexed Fields are searchable and sortable. You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results.  Stored: The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. Confidential 14
Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms Dynamic Fields, in the spirit of Lucene! <dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/> <dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/> <dynamicFieldname="*_t"  type="text“   indexed="true" stored="true"/> 15
Other declaration <uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply Confidential 16
Indexing data Using curl to interact with Solr: http://curl.haxx.se/download.html Here are different data formats: Solr'snative XML CSV (Character Separated Value) Rich documents through SolrCell JSON format Direct Database and XML Import through Solr'sDataImportHandler Confidential 17
Add / Update documents HTTP POST to add / update <add>     <doc boost=“2”>         <field name=“article”>05991</field>         <field name=“title”>Apache Solr</field>         <field name=“subject”>An intro...</field>         <field name=“category”>search</field>         <field name=“category”>lucene</field>         <field name=“body”>Solr is a full...</field>     </doc> </add> Confidential 18
Delete documents Delete by Id <delete><id>05591</id></delete> Delete by Query (multiple documents) <delete><query>manufacturer:microsoft</query></delete> Confidential 19
Commit / Optimize <commit/> tells Solr that all changes made since the last commit should be made available for searching. <optimize/> same as commit. Merges all index segments. Restructures Lucene 's files to improve performance for searching. Optimization is generally good to do when indexing has completed If there are frequent updates, you should schedule optimization for low-usage times An index does not need to be optimized to work properly. Optimization can be a time-consuming process. Confidential 20
Index XML documents Use the command line tool for POSTing raw XML to a Solr Other options: -Ddata=[files|args|stdin] -Durl=http://localhost:8983/solr/update -Dcommit=yes (Option default values are in red) Example: java -jar post.jar *.xml java -Ddata=args  -jar post.jar "<delete><id>42</id></delete>" java -Ddata=stdin -jar post.jar java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>" Confidential 21
Index CSV file usingHTTP POST curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML. Example: using HTTP-POST to send the CSV data over the network to the Solr server:  curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml Confidential 22
Index CSV usingremote streaming Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work.  Change enableRemoteStreaming="true“ in solrconfig.xml: <requestParsersenableRemoteStreaming="false“multipartUploadLimitInKB="2048"/>  ,[object Object],java -Ddata=args -Durl=http://localhost:9090/solr/update -jar post.jar "<commit/>" curl http://localhost:9090/solr/update/csv -F "stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv" -F "commit=true" –F "optimize=true" -F "stream.contentType=text/plain;charset=utf-8" curl "http://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv&commit=true&optimize=true" Confidential 23
Index rich document withSolr Cell Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others Example: curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html" curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html) Capture <div> tags separate, and then map that field to a dynamic field named foo_t: curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div"  -F tutorial=@tutorial.pdf (index pdf) Confidential 24
Updating a Solr Index with JSON The JSON request handler needs to be configured in solrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> Example: curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json" Confidential 25
Searching Spellcheck Editorial results replacement Scaling index size with distributed search Confidential 26
Default Query Syntax Lucene Query Syntax [; sort specification] mission impossible; releaseDatedesc +mission +impossible –actor:cruise “mission impossible” –actor:cruise title:spiderman^10 description:spiderman description:“spiderman movie”~10 +HDTV +weight:[0 TO 100] Wildcard queries: te?t, te*t, test* Confidential 27
Default Parameters Query Arguments for HTTP GET/POST to /select Confidential 28
Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price <response><responseHeader><status>0</status>   <QTime>1</QTime></responseHeader>   <result numFound="16173" start="0">     <doc>        <str name="name">Apple 60 GB iPod with Video</str>       <float name="price">399.0</float>       </doc>      <doc>        <str name="name">ASUS Extreme N7800GTX/2DHTV</str>       <float name="price">479.95</float>     </doc>   </result> </response> 29
Query response writers query responses will be written using the 'wt' request parameter matching the name of a registered writer. The "default" writer is the default and will be used if 'wt' is not specified in the request E.g.:  http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true Confidential 30
Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers 31
Configuring Relevancy <fieldtype name="text" class="solr.TextField">  <analyzer>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>    <filter class="solr.LowerCaseFilterFactory"/>    <filter class="solr.SynonymFilterFactory"              synonyms="synonyms.txt“/>    <filter class="solr.StopFilterFactory“              words=“stopwords.txt”/>    <filter class="solr.EnglishPorterFilterFactory"                 protected="protwords.txt"/>  </analyzer> </fieldtype> 32
Faceted Browsing Example 33
Faceted Browsing 34 computer_type:PC proc_manu:Intel = 594 memory:[1GB TO *] proc_manu:AMD intersection Size() = 382 computer price asc Search(Query,Filter[],Sort,offset,n) price:[0 TO 500] = 247 price:[500 TO 1000] section of ordered results = 689 Unordered set of all results manu:Dell = 104 DocList DocSet manu:HP = 92 manu:Lenovo = 75 Query Response
Index optimization Confidential 35
Updater High Availability Dynamic HTML Generation Appservers HTTP search requests Load Balancer Solr Searchers Index Replication admin queries DB updates updates admin terminal Solr Master
Distributed and replicated Solr architecture Confidential 37
Index by using SolrJ Confidential 38
Query with SolrJ Confidential 39
Distributed and replicated Solrarchitecture (cont.) At this time, applications must still handle the process of sending the documents to individual shards for indexing The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents. Confidential 40
Advance Functionality Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL) Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…) Support for NoSQL database like MongoDB, Cassandra? 41
Other open source server Sphinx Elastic Search Confidential 42
Resources ,[object Object]
http://wiki.apache.org/solr/ExtractingRequestHandler
http://lucene.apache.org/tika/
http://wiki.apache.org/solr/

More Related Content

What's hot

Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Clean architectures with fast api pycones
Clean architectures with fast api   pyconesClean architectures with fast api   pycones
Clean architectures with fast api pyconesAlvaro Del Castillo
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Mongoose: MongoDB object modelling for Node.js
Mongoose: MongoDB object modelling for Node.jsMongoose: MongoDB object modelling for Node.js
Mongoose: MongoDB object modelling for Node.jsYuriy Bogomolov
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Model Your Application Domain, Not Your JSON Structures
Model Your Application Domain, Not Your JSON StructuresModel Your Application Domain, Not Your JSON Structures
Model Your Application Domain, Not Your JSON StructuresMarkus Lanthaler
 
Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까? Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까? Kim Hunmin
 

What's hot (20)

Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
TypeScript
TypeScriptTypeScript
TypeScript
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Clean architectures with fast api pycones
Clean architectures with fast api   pyconesClean architectures with fast api   pycones
Clean architectures with fast api pycones
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Spring Security 5
Spring Security 5Spring Security 5
Spring Security 5
 
Mongoose: MongoDB object modelling for Node.js
Mongoose: MongoDB object modelling for Node.jsMongoose: MongoDB object modelling for Node.js
Mongoose: MongoDB object modelling for Node.js
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
The Power of CSS Flexbox
The Power of CSS FlexboxThe Power of CSS Flexbox
The Power of CSS Flexbox
 
Model Your Application Domain, Not Your JSON Structures
Model Your Application Domain, Not Your JSON StructuresModel Your Application Domain, Not Your JSON Structures
Model Your Application Domain, Not Your JSON Structures
 
Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까? Facebook은 React를 왜 만들었을까?
Facebook은 React를 왜 만들었을까?
 

Viewers also liked

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.ashish0x90
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkkeval dalasaniya
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for DrupalChris Caple
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)searchbox-com
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solrtomhill
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax
 
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondWebinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondLucidworks
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpDataStax
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 

Viewers also liked (20)

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Apache solr
Apache solrApache solr
Apache solr
 
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondWebinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and Beyond
 
Solr5
Solr5Solr5
Solr5
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt Stump
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 

Similar to Apache Solr

Getting Started With The Talis Platform
Getting Started With The Talis PlatformGetting Started With The Talis Platform
Getting Started With The Talis PlatformLeigh Dodds
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
jkljklj
jkljkljjkljklj
jkljkljhoefo
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
01. http basics v27
01. http basics v2701. http basics v27
01. http basics v27Eoin Keary
 
Intro to web services
Intro to web servicesIntro to web services
Intro to web servicesNeil Ghosh
 
Mashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceMashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceElad Elrom
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
HTTPs Strict Transport Security
HTTPs    Strict Transport Security HTTPs    Strict Transport Security
HTTPs Strict Transport Security Gol D Roger
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
03 form-data
03 form-data03 form-data
03 form-datasnopteck
 
The Semantic Web An Introduction
The Semantic Web An IntroductionThe Semantic Web An Introduction
The Semantic Web An Introductionshaouy
 
SPARQLing Services
SPARQLing ServicesSPARQLing Services
SPARQLing ServicesLeigh Dodds
 
Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Sajindbg Dbg
 
An Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and QueryAn Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and QueryOlaf Hartig
 

Similar to Apache Solr (20)

Getting Started With The Talis Platform
Getting Started With The Talis PlatformGetting Started With The Talis Platform
Getting Started With The Talis Platform
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
jkljklj
jkljkljjkljklj
jkljklj
 
SOAP Overview
SOAP OverviewSOAP Overview
SOAP Overview
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
01. http basics v27
01. http basics v2701. http basics v27
01. http basics v27
 
Intro to web services
Intro to web servicesIntro to web services
Intro to web services
 
Mashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceMashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 Unconference
 
Web Services
Web ServicesWeb Services
Web Services
 
Web Services
Web ServicesWeb Services
Web Services
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
HTTPs Strict Transport Security
HTTPs    Strict Transport Security HTTPs    Strict Transport Security
HTTPs Strict Transport Security
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
03 form-data
03 form-data03 form-data
03 form-data
 
The Semantic Web An Introduction
The Semantic Web An IntroductionThe Semantic Web An Introduction
The Semantic Web An Introduction
 
SPARQLing Services
SPARQLing ServicesSPARQLing Services
SPARQLing Services
 
Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction
 
RIA and Ajax
RIA and AjaxRIA and Ajax
RIA and Ajax
 
An Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and QueryAn Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and Query
 

Apache Solr

  • 1. Enterprise search with Solr Minh Tran
  • 2. Why does search matter? Then: Most of the data encountered created for the web Heavy use of a site ‘s search function considered a failure in navigation Now: Navigation not always relevant Less patience to browse Users are used to “navigation by search box” Confidential 2
  • 3. What is SOLR Open source enterprise search platform based on Apache Lucene project. REST-like HTTP/XML and JSONAPIs Powerful full-text search, hit highlighting, faceted search Database integration, and rich document (e.g., Word, PDF) handling Dynamic clustering, distributed search and index replication Loose Schema to define types and fields Written in Java5, deployable as a WAR Confidential 3
  • 4. Public Websites using Solr Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix See here for more information: http://wiki.apache.org/solr/PublicServers Confidential 4
  • 5. Architecture 5 Admin Interface HTTP Request Servlet Update Servlet Standard Request Handler Disjunction Max Request Handler Custom Request Handler XML Update Interface XML Response Writer Solr Core Update Handler Caching Config Schema Analysis Concurrency Lucene Replication Confidential
  • 6. Starting Solr We need to set these settings for SOLR: solr.solr.home: SOLR home folder contains conf/solrconfig.xml solr.data.dir: folder contains index folder Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory. For e.g: java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty) Other web server, set these values by setting Java properties Confidential 6
  • 7. Web Admin Interface Confidential 7
  • 9. How Solr Sees the World An index is built of one or more Documents A Document consists of one or more Fields Documents are composed of fields A Field consists of a name, content, and metadata telling Solr how to handle the content. You can tell Solr about the kind of data a field contains by specifying its field type Confidential 9
  • 10. Field Analysis Field analyzers are used both during ingestion, when a document is indexed, and at query time An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes. Tokenizersbreak field data into lexical units, or tokens Example: Setting all letters to lowercase Eliminating punctuation and accents, mapping words to their stems, and so on “ram”, “Ram” and “RAM” would all match a query for “ram” Confidential 10
  • 11. Schema.xml schema.xml file located in ../solr/conf schema file starts with <schema> tag Solr supports one schema per deployment The schema can be organized into three sections: Types Fields Other declarations 11
  • 12. Example for TextField type Confidential 12
  • 13. Filter explanation StopFilterFactory: Tokenize on whitespace, then removed any common words WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc. LowerCaseFilterFactory: lowercase all terms. EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm. E.g: “runs, running, ran”  its elemental root "run" RemoveDuplicatesTokenFilterFactory: Remove any duplicates: Confidential 13
  • 14. Field Attributes Indexed: Indexed Fields are searchable and sortable. You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. Stored: The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. Confidential 14
  • 15. Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms Dynamic Fields, in the spirit of Lucene! <dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/> <dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/> <dynamicFieldname="*_t" type="text“ indexed="true" stored="true"/> 15
  • 16. Other declaration <uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply Confidential 16
  • 17. Indexing data Using curl to interact with Solr: http://curl.haxx.se/download.html Here are different data formats: Solr'snative XML CSV (Character Separated Value) Rich documents through SolrCell JSON format Direct Database and XML Import through Solr'sDataImportHandler Confidential 17
  • 18. Add / Update documents HTTP POST to add / update <add> <doc boost=“2”> <field name=“article”>05991</field> <field name=“title”>Apache Solr</field> <field name=“subject”>An intro...</field> <field name=“category”>search</field> <field name=“category”>lucene</field> <field name=“body”>Solr is a full...</field> </doc> </add> Confidential 18
  • 19. Delete documents Delete by Id <delete><id>05591</id></delete> Delete by Query (multiple documents) <delete><query>manufacturer:microsoft</query></delete> Confidential 19
  • 20. Commit / Optimize <commit/> tells Solr that all changes made since the last commit should be made available for searching. <optimize/> same as commit. Merges all index segments. Restructures Lucene 's files to improve performance for searching. Optimization is generally good to do when indexing has completed If there are frequent updates, you should schedule optimization for low-usage times An index does not need to be optimized to work properly. Optimization can be a time-consuming process. Confidential 20
  • 21. Index XML documents Use the command line tool for POSTing raw XML to a Solr Other options: -Ddata=[files|args|stdin] -Durl=http://localhost:8983/solr/update -Dcommit=yes (Option default values are in red) Example: java -jar post.jar *.xml java -Ddata=args -jar post.jar "<delete><id>42</id></delete>" java -Ddata=stdin -jar post.jar java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>" Confidential 21
  • 22. Index CSV file usingHTTP POST curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML. Example: using HTTP-POST to send the CSV data over the network to the Solr server: curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml Confidential 22
  • 23.
  • 24. Index rich document withSolr Cell Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others Example: curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html" curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html) Capture <div> tags separate, and then map that field to a dynamic field named foo_t: curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F tutorial=@tutorial.pdf (index pdf) Confidential 24
  • 25. Updating a Solr Index with JSON The JSON request handler needs to be configured in solrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> Example: curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json" Confidential 25
  • 26. Searching Spellcheck Editorial results replacement Scaling index size with distributed search Confidential 26
  • 27. Default Query Syntax Lucene Query Syntax [; sort specification] mission impossible; releaseDatedesc +mission +impossible –actor:cruise “mission impossible” –actor:cruise title:spiderman^10 description:spiderman description:“spiderman movie”~10 +HDTV +weight:[0 TO 100] Wildcard queries: te?t, te*t, test* Confidential 27
  • 28. Default Parameters Query Arguments for HTTP GET/POST to /select Confidential 28
  • 29. Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price <response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result> </response> 29
  • 30. Query response writers query responses will be written using the 'wt' request parameter matching the name of a registered writer. The "default" writer is the default and will be used if 'wt' is not specified in the request E.g.: http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true Confidential 30
  • 31. Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers 31
  • 32. Configuring Relevancy <fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> </analyzer> </fieldtype> 32
  • 34. Faceted Browsing 34 computer_type:PC proc_manu:Intel = 594 memory:[1GB TO *] proc_manu:AMD intersection Size() = 382 computer price asc Search(Query,Filter[],Sort,offset,n) price:[0 TO 500] = 247 price:[500 TO 1000] section of ordered results = 689 Unordered set of all results manu:Dell = 104 DocList DocSet manu:HP = 92 manu:Lenovo = 75 Query Response
  • 36. Updater High Availability Dynamic HTML Generation Appservers HTTP search requests Load Balancer Solr Searchers Index Replication admin queries DB updates updates admin terminal Solr Master
  • 37. Distributed and replicated Solr architecture Confidential 37
  • 38. Index by using SolrJ Confidential 38
  • 39. Query with SolrJ Confidential 39
  • 40. Distributed and replicated Solrarchitecture (cont.) At this time, applications must still handle the process of sending the documents to individual shards for indexing The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents. Confidential 40
  • 41. Advance Functionality Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL) Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…) Support for NoSQL database like MongoDB, Cassandra? 41
  • 42. Other open source server Sphinx Elastic Search Confidential 42
  • 43.
  • 47. Solr 1.4 Enterprise Search Server.43
  • 48.
  • 51. Apache Conf Europe 2006 - Yonik Seeley