SlideShare a Scribd company logo
1 of 39
Download to read offline
Searching The United States Code
        with Solr/Lucene
  Paul Nelson / Ronald Matamoros, Search Technologies
       pnelson@searchtechnologies.com, 5/25/2011
          rmatamoros@searchtechnologies.com
Searching the
           United States Code
§  Who are we:
  •  Paul Nelson, Chief Architect
  •  Ronald Matamoros, Lead Engineer
§  Our Mission: Replace Personal Librarian Search
  •  A 20-Year-Old Search Engine!
§  Key Challenges
  •  How to index this massive, complex, 85-year-old
     document?
  •  How to replicate 20-Year-Old search features?
§  Government Documents are Fun!

                                                       3
Search Technologies
§  The largest independent provider of enterprise
    search expertise and services
§  80 full-time dedicated search engine experts
§  200+ customers
§  Technology Neutral
   •  (yeah, we know
      Sphinx too)
§  Offices All Over
   •  DC, NY, CA, MD,
      OH, UK, CR…


                                                     4
A Quick Civics Lesson…
§  The United States Code
  •  The general & permanent laws of the U.S.
     Government – All in one place
  •  51 titles
     §  Agriculture, Armed Forces, Conservation, The President,
         Food and Drugs, Postal Service, Public Health…
  •  First Version: 1926
§  The Office of the Law Revision Council (OLRC)
  •  20 lawyers who author the U.S. Code
  •  They report to the Speaker of the House of
     Representatives
§  Bonus Question: Which Title is the largest?
                                                                   5
Major Challenges
1.  Document Parsing
  •  A 50 Volume Table Of Contents!


2.  Query Parsing
  •  Custom Features (exact case, exact suffix,
     proximity, query templates, lemmatization, lots
     of fields…)


3.  Searching & Highlighting Fields
  •  Some fields are embedded in the document
  •  These fields must be highlighted in context

                                                       6
screenshot




             7
screenshot




             8
screenshot




             9
10
Part The First:
Document Processing



                      11
Document Processing / Indexing

USC      Parse &      Embed   Construct                Xform &
        Granularize    Refs    XHTML
                                            Store
                                                        Index
                                                                 Solr
Title


                                          Repository




                                                                        12
Field Type 1: Extracted to Index
                                      Page Numbers
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
                                        Heading
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
                                                                         Title
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->
<!-- field-start:notes -->
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>           Source Credit
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …       13
Document Processing / Indexing

USC        Parse &            Embed     Construct                Xform &
          Granularize          Refs      XHTML
                                                      Store
                                                                  Index
                                                                           Solr
Title


                                                    Repository

                   Title 14


          ch. 1     ch. 2      ch. 3    …
  pt. A   pt. B     pt. C      …
          sec. 1   sec. 2      sec. 3   …

                                                                                  14
Field Type 2: Embedded Refs
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
                                                    Statute at Large
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->                                                            Public Law
<!-- field-start:notes --> USC Refs
                     Other
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
                                                                   Public Law
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>           Public Law
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …          15
Document Processing / Indexing

USC      Parse &      Embed   Construct                Xform &
        Granularize    Refs    XHTML
                                            Store
                                                        Index
                                                                 Solr
Title


                                          Repository




                                                                        16
Document Processing / Indexing

USC      Parse &      Embed       Construct                 Xform &
        Granularize    Refs        XHTML
                                                 Store
                                                             Index
                                                                      Solr
Title


                                               Repository




             §  /US-Code
                  §  /2010
                       §  /title2
                             §  /USC-title2-section1532.htm
                             §  /USC-title2-node3-rule5.htm


                                                                             17
Part The Second:
Token Processing



                   18
Token Processing 1
     xhtml tag tokenizer                             <!-- field-start:amendment-note -->
                                                     <h4 class="note-head">
<!-- field-start:amendment-note -->                  Amendments
<h4 class="note-head">Amendments</h4>
                                                     </h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;
296 substituted &ldquo;Department of …               <p class="note-body">
<!-- field-end:amendment-note -->
                                                     2002
                                                     Pub
                                                     L
                                                     107
                                                     296
                                                     Substituted
                                                     Department
                                                     of
                                                     <!-- field-end:amendment-note -->



                                                                                           19
Field Type 3: Marked Within Doc
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…
<!-- field-end:sourcecredit -->
<!-- field-start:notes -->
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>
<p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …
<!-- field-end:amendment-note -->
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>
<p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …       20
Token Processing 2
Mark Start and End Tags
<!-- field-start:amendment-note -->   S/amendment
<h4 class="note-head">                <h4 class="note-head">
Amendments                            Amendments
</h4>                                 </h4>
<p class="note-body">                 <p class="note-body">
2002                                  2002
Pub                                   Pub
L                                     L
107                                   107
296                                   296
Substituted                           Substituted
Department                            Department
of                                    of
<!-- field-end:amendment-note -->     E/amendment


                                                               21
Token Processing 3
Remove XHTML Tags
S/amendment                  S/amendment
<h4 class="note-head">
Amendments                   Amendments
</h4>
<p class="note-body">
2002                         2002
Pub                          Pub
L                            L
107                          107
296                          296
Substituted                  Substituted
Department                   Department
of                           of
E/amendment                  E/amendment


                                           22
Token Processing 4
Tag Original Case & Lower Case

S/amendment                S/amendment
Amendments                 O/Amendments    L/amendments
2002                       O/2002          L/2002
Pub                        O/Pub           L/pub
L                          O/L             L/l
107                        O/107           L/107
296                        O/296           L/296
Substituted                O/Substituted   L/substituted
Department                 O/Department    L/department
of                         O/of            L/of
E/amendment                E/amendment




                                                           23
Token Processing 5
 Lemmatize
         Uses dictionary-based lemmatizer based on GCIDE and WordNet

S/amendment                         S/amendment
O/Amendments    L/amendments        O/Amendments    L/amendments    amendment
O/2002          L/2002              O/2002          L/2002          2002
O/Pub           L/pub               O/Pub           L/Pub           pub
O/L             L/l                 O/L             L/l;            l
O/107           L/107               O/107           L/107           107
O/296           L/296               O/296           L/296           296
O/Substituted   L/substituted       O/Substituted   L/Substituted   substitute
O/Department    L/department        O/Department    L/Department    department
O/of            L/of                O/of            L/of            of
E/amendment                         E/amendment




                                                                                 24
Part The Third:
Query Processing



                   25
Query Processing
                           (not all stages shown)

                                                                build
Query             mark       mark                     query
          parse                         lemmatize              lucene   search
String            exact:    phrases                 template
                                                                query


   §  Communicates via generic QNode Class
         •  Simpler to manipulate than Lucene operators
   §  Can produce FAST FQL as well
         •  (cue the derisive catcalls)
   §  But most importantly:
         •  It is a Query Processing Pipeline
            §  Mix and match query processing modules


                                                                           26
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                     build
Query              mark         mark                      query
         parse                             lemmatize                lucene   search
String            original   lowercase                  template
                                                                     query




                              and

                 exact:               phrase           amendment:


                 |FOIA|       |top|        |secret|    |RECORDS|




                                                                                27
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                     build
Query              mark         mark                      query
         parse                             lemmatize                lucene   search
String            original   lowercase                  template
                                                                     query




                              and

                 O/FOIA               phrase           amendment:


                              |top|        |secret|    |RECORDS|




                                                                                28
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                    build
Query              mark         mark                     query
         parse                            lemmatize                lucene   search
String            original   lowercase                 template
                                                                    query




                              and

                 O/FOIA             phrase            amendment:


                             |L/top|     |L/secret|    |records|




                                                                               29
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                    build
Query              mark         mark                     query
         parse                            lemmatize                lucene   search
String            original   lowercase                 template
                                                                    query




                              and

                 O/FOIA             phrase            amendment:


                             |L/top|     |L/secret|    |record|




                                                                               30
Query Processing
                 exact:FOIA top secret amendment:RECORDS

                                                                   build
Query              mark         mark                    query
         parse                            lemmatize               lucene    search
String            original   lowercase                template
                                                                   query




                              and

                 O/FOIA             phrase            between

                                                                 S/amendment
                             |L/top|     |L/secret|
                                                                 |record|

                                                                 E/amendment

                                                                               31
The between() Operator
§  between(start-tag, end-tag, pos-clause, neg-clause)

§  start-tag à Starting tag, e.g. S/amendment
§  end-tag à Ending tag, e.g. E/amendment

§  pos-clause à words which must occur between
    start and end
   •  Note: Requires a nested ScanAnd() operator
§  neg-clause à words which must not occur between
    start and end

                                                      32
Part the Fourth:
Hierarchical Navigation



                          33
screenshot




             34
Hierarchies: Requirements
§  Any number of levels
      §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part,
          Section
§  Levels vary across titles
      §  Title 1: 3 levels
      §  Title 26: 8 levels
§  Multiple views:
      §  Children
      §  Ancestors
      §  Ancestor s Siblings
§  Multiple search scopes:
      §  Only children, all descendents, everything

                                                                    35
Hierarchies: Ancestor-Siblings
§  US-Code
  •  Title 1
  •  Title 2
     §  Chapter 1
     §  Chapter 2
         –  Part 1
         –  Part 2
              •  Section 2.1
              •  Section 2.2
         –  Part 3
         –  Part 4
     §  Chapter 3
     §  Chapter 4
  •  Title 3

                                  36
Hierarchies: Fields
§  ancestors
   •  Searching
      §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-
          subchapter2
§  encodedAncestors – for display only
   •  Where the node exists within the hierarchy
      §  id;heading;subjectTitle//id;heading;subjectTitle//...
      §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//
          USC-title2-chapter25-subchapter2;Subchapter II;Regulatory
          Accountabilty and Reform
§  parentId – ID of the parent node
      §  USC-title2-chapter25-subchapter2
§  treesort – Hierarchical sort field, e.g. 13/000/0/00882

                                                                       37
Hierarchies: Tree Sort
§  Sorting In Print Order
   •  Front Matter à Titles à Tables à etc.
   •  Everything padded to fixed-length

                    01/011/1/02032

01 = USC Title                            Sequence # in file

                 011 = Title 11   1 = An Appendix




                                                               38
Hierarchies: Sample Searches
§  Assuming Node = USC-title2-chapter25
§  Search Children
   •  parentId:USC-title2-chapter25
§  Search All Descendents
   •  ancestors:USC-title2-chapter25
§  Ancestor Siblings
   •  (parentId:USC OR parentId:USC-title2 OR
      parentId:USC-title2-chapter25)




                                                39
Contact
§  Paul Nelson
   •  pnelson@searchtechnologies.com
§  Ronald Matamoros
   •  rmatamoros@searchtechnologies.com
§  Search Technologies
   •  http://searchtechnologies.com




                                          40

More Related Content

Viewers also liked

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Lucidworks (Archived)
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル彰 村地
 
Hellosong
HellosongHellosong
Hellosongtanica
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrellaguest986e5ae
 
Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search applicationLucidworks (Archived)
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceLucidworks (Archived)
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemLucidworks (Archived)
 
Zombie
ZombieZombie
Zombietanica
 
Civil War
Civil WarCivil War
Civil Wartanica
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Tv ролики
Tv роликиTv ролики
Tv роликиtarodnova
 

Viewers also liked (18)

Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
Hellosong
HellosongHellosong
Hellosong
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrella
 
Discover the new techniques about search application
Discover the new techniques about search applicationDiscover the new techniques about search application
Discover the new techniques about search application
 
Using Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User ExperienceUsing Solr in Online Travel Shopping to Improve User Experience
Using Solr in Online Travel Shopping to Improve User Experience
 
Jazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search ProblemJazeed about Solr - People as A Search Problem
Jazeed about Solr - People as A Search Problem
 
What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
Zombie
ZombieZombie
Zombie
 
Civil War
Civil WarCivil War
Civil War
 
Portades
PortadesPortades
Portades
 
Linked In Introduction
Linked In IntroductionLinked In Introduction
Linked In Introduction
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Tv ролики
Tv роликиTv ролики
Tv ролики
 

Similar to Searching the United States Code with Solr/Lucene

Searching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald MatamorosSearching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald Matamoroslucenerevolution
 
Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Alison Hitchens
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
Paper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresPaper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresHyo jeong Lee
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in databasegafurov_x
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech docBarot Sagar
 
Ugif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUgif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUGIF
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPPieter De Leenheer
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
SPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumSPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumThomas Francart
 
Oracle10g New Features I
Oracle10g New Features IOracle10g New Features I
Oracle10g New Features IDenish Patel
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 

Similar to Searching the United States Code with Solr/Lucene (13)

Searching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald MatamorosSearching The United States Code with Solr/Lucene - By Ronald Matamoros
Searching The United States Code with Solr/Lucene - By Ronald Matamoros
 
Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)Trouble-shooting Tips for Primo (2013)
Trouble-shooting Tips for Primo (2013)
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Paper_Scalable database logging for multicores
Paper_Scalable database logging for multicoresPaper_Scalable database logging for multicores
Paper_Scalable database logging for multicores
 
W1.1 i os in database
W1.1   i os in databaseW1.1   i os in database
W1.1 i os in database
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Apache solr tech doc
Apache solr tech docApache solr tech doc
Apache solr tech doc
 
Ugif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugifUgif 10 2012 beauty ofifmxdiskstructs ugif
Ugif 10 2012 beauty ofifmxdiskstructs ugif
 
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAPOpen Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
Open Standards for the Semantic Web: XML / RDF(S) / OWL / SOAP
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
SPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseumSPARQL queries on CIDOC-CRM data of BritishMuseum
SPARQL queries on CIDOC-CRM data of BritishMuseum
 
Oracle10g New Features I
Oracle10g New Features IOracle10g New Features I
Oracle10g New Features I
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Searching the United States Code with Solr/Lucene

  • 1. Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies pnelson@searchtechnologies.com, 5/25/2011 rmatamoros@searchtechnologies.com
  • 2. Searching the United States Code §  Who are we: •  Paul Nelson, Chief Architect •  Ronald Matamoros, Lead Engineer §  Our Mission: Replace Personal Librarian Search •  A 20-Year-Old Search Engine! §  Key Challenges •  How to index this massive, complex, 85-year-old document? •  How to replicate 20-Year-Old search features? §  Government Documents are Fun! 3
  • 3. Search Technologies §  The largest independent provider of enterprise search expertise and services §  80 full-time dedicated search engine experts §  200+ customers §  Technology Neutral •  (yeah, we know Sphinx too) §  Offices All Over •  DC, NY, CA, MD, OH, UK, CR… 4
  • 4. A Quick Civics Lesson… §  The United States Code •  The general & permanent laws of the U.S. Government – All in one place •  51 titles §  Agriculture, Armed Forces, Conservation, The President, Food and Drugs, Postal Service, Public Health… •  First Version: 1926 §  The Office of the Law Revision Council (OLRC) •  20 lawyers who author the U.S. Code •  They report to the Speaker of the House of Representatives §  Bonus Question: Which Title is the largest? 5
  • 5. Major Challenges 1.  Document Parsing •  A 50 Volume Table Of Contents! 2.  Query Parsing •  Custom Features (exact case, exact suffix, proximity, query templates, lemmatization, lots of fields…) 3.  Searching & Highlighting Fields •  Some fields are embedded in the document •  These fields must be highlighted in context 6
  • 9. 10
  • 10. Part The First: Document Processing 11
  • 11. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository 12
  • 12. Field Type 1: Extracted to Index Page Numbers <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Heading <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … Title <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> Source Credit <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 13
  • 13. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository Title 14 ch. 1 ch. 2 ch. 3 … pt. A pt. B pt. C … sec. 1 sec. 2 sec. 3 … 14
  • 14. Field Type 2: Embedded Refs <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> Statute at Large <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> Public Law <!-- field-start:notes --> USC Refs Other <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> Public Law <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 15
  • 15. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository 16
  • 16. Document Processing / Indexing USC Parse & Embed Construct Xform & Granularize Refs XHTML Store Index Solr Title Repository §  /US-Code §  /2010 §  /title2 §  /USC-title2-section1532.htm §  /USC-title2-node3-rule5.htm 17
  • 17. Part The Second: Token Processing 18
  • 18. Token Processing 1 xhtml tag tokenizer <!-- field-start:amendment-note --> <h4 class="note-head"> <!-- field-start:amendment-note --> Amendments <h4 class="note-head">Amendments</h4> </h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash; 296 substituted &ldquo;Department of … <p class="note-body"> <!-- field-end:amendment-note --> 2002 Pub L 107 296 Substituted Department of <!-- field-end:amendment-note --> 19
  • 19. Field Type 3: Marked Within Doc <!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1- ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of … 20
  • 20. Token Processing 2 Mark Start and End Tags <!-- field-start:amendment-note --> S/amendment <h4 class="note-head"> <h4 class="note-head"> Amendments Amendments </h4> </h4> <p class="note-body"> <p class="note-body"> 2002 2002 Pub Pub L L 107 107 296 296 Substituted Substituted Department Department of of <!-- field-end:amendment-note --> E/amendment 21
  • 21. Token Processing 3 Remove XHTML Tags S/amendment S/amendment <h4 class="note-head"> Amendments Amendments </h4> <p class="note-body"> 2002 2002 Pub Pub L L 107 107 296 296 Substituted Substituted Department Department of of E/amendment E/amendment 22
  • 22. Token Processing 4 Tag Original Case & Lower Case S/amendment S/amendment Amendments O/Amendments L/amendments 2002 O/2002 L/2002 Pub O/Pub L/pub L O/L L/l 107 O/107 L/107 296 O/296 L/296 Substituted O/Substituted L/substituted Department O/Department L/department of O/of L/of E/amendment E/amendment 23
  • 23. Token Processing 5 Lemmatize Uses dictionary-based lemmatizer based on GCIDE and WordNet S/amendment S/amendment O/Amendments L/amendments O/Amendments L/amendments amendment O/2002 L/2002 O/2002 L/2002 2002 O/Pub L/pub O/Pub L/Pub pub O/L L/l O/L L/l; l O/107 L/107 O/107 L/107 107 O/296 L/296 O/296 L/296 296 O/Substituted L/substituted O/Substituted L/Substituted substitute O/Department L/department O/Department L/Department department O/of L/of O/of L/of of E/amendment E/amendment 24
  • 24. Part The Third: Query Processing 25
  • 25. Query Processing (not all stages shown) build Query mark mark query parse lemmatize lucene search String exact: phrases template query §  Communicates via generic QNode Class •  Simpler to manipulate than Lucene operators §  Can produce FAST FQL as well •  (cue the derisive catcalls) §  But most importantly: •  It is a Query Processing Pipeline §  Mix and match query processing modules 26
  • 26. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and exact: phrase amendment: |FOIA| |top| |secret| |RECORDS| 27
  • 27. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |top| |secret| |RECORDS| 28
  • 28. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |records| 29
  • 29. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase amendment: |L/top| |L/secret| |record| 30
  • 30. Query Processing exact:FOIA top secret amendment:RECORDS build Query mark mark query parse lemmatize lucene search String original lowercase template query and O/FOIA phrase between S/amendment |L/top| |L/secret| |record| E/amendment 31
  • 31. The between() Operator §  between(start-tag, end-tag, pos-clause, neg-clause) §  start-tag à Starting tag, e.g. S/amendment §  end-tag à Ending tag, e.g. E/amendment §  pos-clause à words which must occur between start and end •  Note: Requires a nested ScanAnd() operator §  neg-clause à words which must not occur between start and end 32
  • 34. Hierarchies: Requirements §  Any number of levels §  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section §  Levels vary across titles §  Title 1: 3 levels §  Title 26: 8 levels §  Multiple views: §  Children §  Ancestors §  Ancestor s Siblings §  Multiple search scopes: §  Only children, all descendents, everything 35
  • 35. Hierarchies: Ancestor-Siblings §  US-Code •  Title 1 •  Title 2 §  Chapter 1 §  Chapter 2 –  Part 1 –  Part 2 •  Section 2.1 •  Section 2.2 –  Part 3 –  Part 4 §  Chapter 3 §  Chapter 4 •  Title 3 36
  • 36. Hierarchies: Fields §  ancestors •  Searching §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25- subchapter2 §  encodedAncestors – for display only •  Where the node exists within the hierarchy §  id;heading;subjectTitle//id;heading;subjectTitle//... §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform// USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform §  parentId – ID of the parent node §  USC-title2-chapter25-subchapter2 §  treesort – Hierarchical sort field, e.g. 13/000/0/00882 37
  • 37. Hierarchies: Tree Sort §  Sorting In Print Order •  Front Matter à Titles à Tables à etc. •  Everything padded to fixed-length 01/011/1/02032 01 = USC Title Sequence # in file 011 = Title 11 1 = An Appendix 38
  • 38. Hierarchies: Sample Searches §  Assuming Node = USC-title2-chapter25 §  Search Children •  parentId:USC-title2-chapter25 §  Search All Descendents •  ancestors:USC-title2-chapter25 §  Ancestor Siblings •  (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25) 39
  • 39. Contact §  Paul Nelson •  pnelson@searchtechnologies.com §  Ronald Matamoros •  rmatamoros@searchtechnologies.com §  Search Technologies •  http://searchtechnologies.com 40