SlideShare a Scribd company logo
1 of 28
Download to read offline
Scaling search to a million pages
 with Solr, Django and Python

      Toby White
      toby@timetric.com
      @tow21
1,079,446!!!
Scaling search to a million pages with Solr, Python, and Django
Data store



             Django


                      Big Bad Web
Data store




             Django


                      Big Bad Web
Key-Value Store

     Filesystem
     Berkeley DB   }   unstructured

     MySQL         -    structured
Foreign Key (RDBMS)

    SQLite
    MySQL         related content
    Postgres       through JOINs
    Oracle              over
    ...           structured data
Search Engines


   Solr (Lucene)     Denormalized,
   Xapian            Inverted Index
   (Whoosh)         over unstructured/
                   semi-structured data
Other routes to full-text search

  http://www.postgresql.org/docs/8.4/static/textsearch.html
  http://code.google.com/p/djangosearch/



  http://www.sphinxsearch.com/
Solr: HTTP interface to Lucene
  Lucene written by Doug Cutting (HADOOP),
          first release 2001.
  Solr in-house CNET project, open-sourced in 2006

  Solr 1.4, Lucene 3.0 released November 2009

  Solr + Lucene merged in March 2010

  Next version - 1.5/3.1/4.0 - not for production use yet.
Solr         RDBMS
   Index          Table
composed of    composed of
 Documents        Rows


   ALL DOCUMENTS HAVE
   THE SAME STRUCTURE
•Optional columns                            Document       Field options



•Denormalized data                           Entity type      required


                                                Title         required


                                              Identifier      uniqueKey


                                           Pub. Frequency
   Book         Magazine       Person
                                             Associated
                                                            multiValued
    Title          Title      First name       name

                                                            multiValued,
                                           Default Search
    ISBN          ISSN        Last name                       default


   Author       Publication
 (FK Person)    Frequency                    copyField          Title
                  Editor
               (FK Person)                                  Associated
                                           Default Search
                                                              Name
                Contributer
               (M2M Person)
There is no update, only overwrite!!!

              Book             Book

             Solar           Solr 1.4
          Enterprise        Enterprise
         Search Server     Search Server

            Identifier        Identifier

           Pub. Freq.        Pub. Freq.

          David Smiley,    David Smiley,
           Eric Pugh        Eric Pugh



    Solr can't overwrite without a uniqueKey
Schema design
<field name="title"
                                     text
       type="text"
                                     int
       indexed="true"
                                     long
       stored="true"
                                     float
       required="true"
                                     double
       multiValued="false"
                                     date
/>
                             query

  What do you want to search on?

  What do you want to do with results?
<xml>,                            <xml>,
csv,                              {json},
                                  exec. python
         Ingest          Output
         HTTP     Solr   HTTP




                Query:
    URL-escaped Lucene query syntax
                (yuck)
GET http://localhost:8983/solr/select/?q=searchterm




    GET http://localhost:8983/solr/current/select/?
                      fq=private
%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2
0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags
 %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A
%22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR
   +%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA
   %22+AND+tags%3A%22united+kingdom%22+AND+is_index
                   %3Atrue%5E100%29
Need ORM equivalent (OIM?)

Sunburnt:
http://timetric.com/about/opensource/#sunburnt
http://github.com/tow/sunburnt




http://haystacksearch.org/
(cleaves close to Django, not schema-driven)
GET http://localhost:8983/solr/current/select/?
                       fq=private
%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2
0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags
 %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A
    %22united+kingdom%22%29+OR+%28%28tags%3A%22ons
 %3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united
        +kingdom%22+AND+is_index%3Atrue%5E100%29



   solr.query(tags="ons:dataseries-fullid=YBUKQA")
       .query(tags="united kingdom")
       .filter(private=False)
       .boost_relevancy(100, is_index=True)
       .facet_by("tags", mincount=1, limit=20)
       .paginate(rows=20)
Scaling search to a million pages with Solr, Python, and Django
Faceting
MoreLikeThis
Highlighting
Pagination
Sorting
   http://wiki.apache.org/solr/FrontPage


         http://packtpub.com/
         solr-1-4-enterprise-search-server
Scaling to a million pages ...
                 - talk to the Guardian (Content API)




   Decouple read/write
   Re-indexing/optimizing strategies
   FieldType/Analyzer/Tokenizer tweaks
Decouple read/write

  Separate processes - many readers, single write pipeline.
                 Beware multiple writers!



 Remember standard DB practice -
  write to master, read from slave.
Add
Index                 documents
                                      Index
             Fast                     Index



                                              Commit

Index

                                                   Index
         Warm up
        facet cache

                           Index
                                   Optimize
Scaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and Django
"UK crime: Betting, gaming and lotteries (year ending 5th April)"




    Tokenizer         Betting
                                    Analyzer
                                (Porter stemmer)
                                                        bet
                                     Tokenizer
                                  (character filter)


                       BE,T
     Tokenizer
    (whitespace)

Belgium, Unemployment rate by gender, Total (BE,T)
In the small
        Understand Solr schemas - build one for your data.
                             how do you want to query?
                        how do you want to show results?




In the large
  Understand Solr architecture - build around your data-flow.
                      how/when do you want to read/write?
           what shape/characteristics does your corpus have
Thanks for listening!

       questions welcome ...

               toby@timetric.com
                        @tow21

More Related Content

Viewers also liked

How to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/SolrHow to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/Solrlucenerevolution
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
Top apache solr features
Top apache solr featuresTop apache solr features
Top apache solr featuresIntellipaat
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.Andrii Soldatenko
 
Full Text search in Django with Postgres
Full Text search in Django with PostgresFull Text search in Django with Postgres
Full Text search in Django with Postgressyerram
 
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondWebinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondLucidworks
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoLucidworks
 
Curso Formacion Apache Solr
Curso Formacion Apache SolrCurso Formacion Apache Solr
Curso Formacion Apache SolrEmpathyBroker
 
Apache Solr Search Course Drupal 7 Acquia
Apache Solr Search Course Drupal 7 AcquiaApache Solr Search Course Drupal 7 Acquia
Apache Solr Search Course Drupal 7 AcquiaDropsolid
 
Practical continuous quality gates for development process
Practical continuous quality gates for development processPractical continuous quality gates for development process
Practical continuous quality gates for development processAndrii Soldatenko
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and PythonAndrii Soldatenko
 
Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrLucidworks
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 

Viewers also liked (20)

How to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/SolrHow to Gain Greater Business Intelligence from Lucene/Solr
How to Gain Greater Business Intelligence from Lucene/Solr
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Top apache solr features
Top apache solr featuresTop apache solr features
Top apache solr features
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
Full Text search in Django with Postgres
Full Text search in Django with PostgresFull Text search in Django with Postgres
Full Text search in Django with Postgres
 
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondWebinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and Beyond
 
Solr5
Solr5Solr5
Solr5
 
Apache solr
Apache solrApache solr
Apache solr
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
 
Curso Formacion Apache Solr
Curso Formacion Apache SolrCurso Formacion Apache Solr
Curso Formacion Apache Solr
 
Apache Solr Search Course Drupal 7 Acquia
Apache Solr Search Course Drupal 7 AcquiaApache Solr Search Course Drupal 7 Acquia
Apache Solr Search Course Drupal 7 Acquia
 
Practical continuous quality gates for development process
Practical continuous quality gates for development processPractical continuous quality gates for development process
Practical continuous quality gates for development process
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and Python
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 

Similar to Scaling search to a million pages with Solr, Python, and Django

2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solrNick Zadrozny
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appNick Zadrozny
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialSteven Francia
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Ramamohan Chokkam
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
Querying rich text with XQuery
Querying rich text with XQueryQuerying rich text with XQuery
Querying rich text with XQuerylucenerevolution
 
Allura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForgeAllura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForgeRick Copeland
 

Similar to Scaling search to a million pages with Solr, Python, and Django (20)

2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Generating Researcher Networks with Identified Persons on a Semantic Service ...
Generating Researcher Networks with Identified Persons on a Semantic Service ...Generating Researcher Networks with Identified Persons on a Semantic Service ...
Generating Researcher Networks with Identified Persons on a Semantic Service ...
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Mongo db japan
Mongo db japanMongo db japan
Mongo db japan
 
ElasticSearch Basics
ElasticSearch Basics ElasticSearch Basics
ElasticSearch Basics
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Querying rich text with XQuery
Querying rich text with XQueryQuerying rich text with XQuery
Querying rich text with XQuery
 
Allura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForgeAllura - an Open Source MongoDB Based Document Oriented SourceForge
Allura - an Open Source MongoDB Based Document Oriented SourceForge
 

Recently uploaded

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 

Recently uploaded (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 

Scaling search to a million pages with Solr, Python, and Django

  • 1. Scaling search to a million pages with Solr, Django and Python Toby White toby@timetric.com @tow21
  • 4. Data store Django Big Bad Web
  • 5. Data store Django Big Bad Web
  • 6. Key-Value Store Filesystem Berkeley DB } unstructured MySQL - structured
  • 7. Foreign Key (RDBMS) SQLite MySQL related content Postgres through JOINs Oracle over ... structured data
  • 8. Search Engines Solr (Lucene) Denormalized, Xapian Inverted Index (Whoosh) over unstructured/ semi-structured data
  • 9. Other routes to full-text search http://www.postgresql.org/docs/8.4/static/textsearch.html http://code.google.com/p/djangosearch/ http://www.sphinxsearch.com/
  • 10. Solr: HTTP interface to Lucene Lucene written by Doug Cutting (HADOOP), first release 2001. Solr in-house CNET project, open-sourced in 2006 Solr 1.4, Lucene 3.0 released November 2009 Solr + Lucene merged in March 2010 Next version - 1.5/3.1/4.0 - not for production use yet.
  • 11. Solr RDBMS Index Table composed of composed of Documents Rows ALL DOCUMENTS HAVE THE SAME STRUCTURE
  • 12. •Optional columns Document Field options •Denormalized data Entity type required Title required Identifier uniqueKey Pub. Frequency Book Magazine Person Associated multiValued Title Title First name name multiValued, Default Search ISBN ISSN Last name default Author Publication (FK Person) Frequency copyField Title Editor (FK Person) Associated Default Search Name Contributer (M2M Person)
  • 13. There is no update, only overwrite!!! Book Book Solar Solr 1.4 Enterprise Enterprise Search Server Search Server Identifier Identifier Pub. Freq. Pub. Freq. David Smiley, David Smiley, Eric Pugh Eric Pugh Solr can't overwrite without a uniqueKey
  • 14. Schema design <field name="title" text type="text" int indexed="true" long stored="true" float required="true" double multiValued="false" date /> query What do you want to search on? What do you want to do with results?
  • 15. <xml>, <xml>, csv, {json}, exec. python Ingest Output HTTP Solr HTTP Query: URL-escaped Lucene query syntax (yuck)
  • 16. GET http://localhost:8983/solr/select/?q=searchterm GET http://localhost:8983/solr/current/select/? fq=private %3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2 0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A %22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR +%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA %22+AND+tags%3A%22united+kingdom%22+AND+is_index %3Atrue%5E100%29
  • 17. Need ORM equivalent (OIM?) Sunburnt: http://timetric.com/about/opensource/#sunburnt http://github.com/tow/sunburnt http://haystacksearch.org/ (cleaves close to Django, not schema-driven)
  • 18. GET http://localhost:8983/solr/current/select/? fq=private %3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=2 0&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags %3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A %22united+kingdom%22%29+OR+%28%28tags%3A%22ons %3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united +kingdom%22+AND+is_index%3Atrue%5E100%29 solr.query(tags="ons:dataseries-fullid=YBUKQA") .query(tags="united kingdom") .filter(private=False) .boost_relevancy(100, is_index=True) .facet_by("tags", mincount=1, limit=20) .paginate(rows=20)
  • 20. Faceting MoreLikeThis Highlighting Pagination Sorting http://wiki.apache.org/solr/FrontPage http://packtpub.com/ solr-1-4-enterprise-search-server
  • 21. Scaling to a million pages ... - talk to the Guardian (Content API) Decouple read/write Re-indexing/optimizing strategies FieldType/Analyzer/Tokenizer tweaks
  • 22. Decouple read/write Separate processes - many readers, single write pipeline. Beware multiple writers! Remember standard DB practice - write to master, read from slave.
  • 23. Add Index documents Index Fast Index Commit Index Index Warm up facet cache Index Optimize
  • 26. "UK crime: Betting, gaming and lotteries (year ending 5th April)" Tokenizer Betting Analyzer (Porter stemmer) bet Tokenizer (character filter) BE,T Tokenizer (whitespace) Belgium, Unemployment rate by gender, Total (BE,T)
  • 27. In the small Understand Solr schemas - build one for your data. how do you want to query? how do you want to show results? In the large Understand Solr architecture - build around your data-flow. how/when do you want to read/write? what shape/characteristics does your corpus have
  • 28. Thanks for listening! questions welcome ... toby@timetric.com @tow21