The document discusses searching the United States Code using Solr/Lucene. It provides background on the authors Paul Nelson and Ronald Matamoros of Search Technologies, and their mission to replace the personal librarian search of the US Code with a new search engine. It outlines some of the key challenges in indexing and searching the massive and complex US Code document. It then provides details on document processing, token processing, and query processing used to search the US Code with Solr/Lucene.
WordPress Websites for Engineers: Elevate Your Brand
Searching the United States Code with Solr/Lucene
1. Searching The United States Code
with Solr/Lucene
Paul Nelson / Ronald Matamoros, Search Technologies
pnelson@searchtechnologies.com, 5/25/2011
rmatamoros@searchtechnologies.com
2. Searching the
United States Code
§ Who are we:
• Paul Nelson, Chief Architect
• Ronald Matamoros, Lead Engineer
§ Our Mission: Replace Personal Librarian Search
• A 20-Year-Old Search Engine!
§ Key Challenges
• How to index this massive, complex, 85-year-old
document?
• How to replicate 20-Year-Old search features?
§ Government Documents are Fun!
3
3. Search Technologies
§ The largest independent provider of enterprise
search expertise and services
§ 80 full-time dedicated search engine experts
§ 200+ customers
§ Technology Neutral
• (yeah, we know
Sphinx too)
§ Offices All Over
• DC, NY, CA, MD,
OH, UK, CR…
4
4. A Quick Civics Lesson…
§ The United States Code
• The general & permanent laws of the U.S.
Government – All in one place
• 51 titles
§ Agriculture, Armed Forces, Conservation, The President,
Food and Drugs, Postal Service, Public Health…
• First Version: 1926
§ The Office of the Law Revision Council (OLRC)
• 20 lawyers who author the U.S. Code
• They report to the Speaker of the House of
Representatives
§ Bonus Question: Which Title is the largest?
5
5. Major Challenges
1. Document Parsing
• A 50 Volume Table Of Contents!
2. Query Parsing
• Custom Features (exact case, exact suffix,
proximity, query templates, lemmatization, lots
of fields…)
3. Searching & Highlighting Fields
• Some fields are embedded in the document
• These fields must be highlighted in context
6
11. Document Processing / Indexing
USC Parse & Embed Construct Xform &
Granularize Refs XHTML
Store
Index
Solr
Title
Repository
12
12. Field Type 1: Extracted to Index
Page Numbers
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
Heading
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
Title
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),…
<!-- field-end:sourcecredit -->
<!-- field-start:notes -->
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4> Source Credit
<p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002—Pub. L. 107–296 substituted “Department of …
<!-- field-end:amendment-note -->
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4>
<p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of … 13
13. Document Processing / Indexing
USC Parse & Embed Construct Xform &
Granularize Refs XHTML
Store
Index
Solr
Title
Repository
Title 14
ch. 1 ch. 2 ch. 3 …
pt. A pt. B pt. C …
sec. 1 sec. 2 sec. 3 …
14
14. Field Type 2: Embedded Refs
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108
documentPDFPage:3 -->
<!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 -->
<!-- itemsortkey:140AAAD -->
<!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-
ESTABLISHMENT AND DUTIES!@!Sec. 1 -->
<!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3>
<!-- field-end:head -->
<!-- field-start:statute -->
Statute at Large
<p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …
<!-- field-end:statute -->
<!-- field-start:sourcecredit -->
<p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),…
<!-- field-end:sourcecredit --> Public Law
<!-- field-start:notes --> USC Refs
Other
<!-- field-start:historicalandrevision-note -->
<h4 class="note-head">Historical and Revision Notes</h4>
<p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1…
<!-- field-end:historicalandrevision-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">Amendments</h4>
<p class="note-body">2002—Pub. L. 107–296 substituted “Department of …
<!-- field-end:amendment-note -->
Public Law
<!-- field-start:effectivedate-amendment-note -->
<h4 class="note-head">Effective Date of 2002 Amendment</h4> Public Law
<p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of … 15
15. Document Processing / Indexing
USC Parse & Embed Construct Xform &
Granularize Refs XHTML
Store
Index
Solr
Title
Repository
16
25. Query Processing
(not all stages shown)
build
Query mark mark query
parse lemmatize lucene search
String exact: phrases template
query
§ Communicates via generic QNode Class
• Simpler to manipulate than Lucene operators
§ Can produce FAST FQL as well
• (cue the derisive catcalls)
§ But most importantly:
• It is a Query Processing Pipeline
§ Mix and match query processing modules
26
26. Query Processing
exact:FOIA top secret amendment:RECORDS
build
Query mark mark query
parse lemmatize lucene search
String original lowercase template
query
and
exact: phrase amendment:
|FOIA| |top| |secret| |RECORDS|
27
27. Query Processing
exact:FOIA top secret amendment:RECORDS
build
Query mark mark query
parse lemmatize lucene search
String original lowercase template
query
and
O/FOIA phrase amendment:
|top| |secret| |RECORDS|
28
28. Query Processing
exact:FOIA top secret amendment:RECORDS
build
Query mark mark query
parse lemmatize lucene search
String original lowercase template
query
and
O/FOIA phrase amendment:
|L/top| |L/secret| |records|
29
29. Query Processing
exact:FOIA top secret amendment:RECORDS
build
Query mark mark query
parse lemmatize lucene search
String original lowercase template
query
and
O/FOIA phrase amendment:
|L/top| |L/secret| |record|
30
30. Query Processing
exact:FOIA top secret amendment:RECORDS
build
Query mark mark query
parse lemmatize lucene search
String original lowercase template
query
and
O/FOIA phrase between
S/amendment
|L/top| |L/secret|
|record|
E/amendment
31
31. The between() Operator
§ between(start-tag, end-tag, pos-clause, neg-clause)
§ start-tag à Starting tag, e.g. S/amendment
§ end-tag à Ending tag, e.g. E/amendment
§ pos-clause à words which must occur between
start and end
• Note: Requires a nested ScanAnd() operator
§ neg-clause à words which must not occur between
start and end
32
34. Hierarchies: Requirements
§ Any number of levels
§ Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part,
Section
§ Levels vary across titles
§ Title 1: 3 levels
§ Title 26: 8 levels
§ Multiple views:
§ Children
§ Ancestors
§ Ancestor s Siblings
§ Multiple search scopes:
§ Only children, all descendents, everything
35
35. Hierarchies: Ancestor-Siblings
§ US-Code
• Title 1
• Title 2
§ Chapter 1
§ Chapter 2
– Part 1
– Part 2
• Section 2.1
• Section 2.2
– Part 3
– Part 4
§ Chapter 3
§ Chapter 4
• Title 3
36
36. Hierarchies: Fields
§ ancestors
• Searching
§ USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-
subchapter2
§ encodedAncestors – for display only
• Where the node exists within the hierarchy
§ id;heading;subjectTitle//id;heading;subjectTitle//...
§ USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//
USC-title2-chapter25-subchapter2;Subchapter II;Regulatory
Accountabilty and Reform
§ parentId – ID of the parent node
§ USC-title2-chapter25-subchapter2
§ treesort – Hierarchical sort field, e.g. 13/000/0/00882
37
37. Hierarchies: Tree Sort
§ Sorting In Print Order
• Front Matter à Titles à Tables à etc.
• Everything padded to fixed-length
01/011/1/02032
01 = USC Title Sequence # in file
011 = Title 11 1 = An Appendix
38
38. Hierarchies: Sample Searches
§ Assuming Node = USC-title2-chapter25
§ Search Children
• parentId:USC-title2-chapter25
§ Search All Descendents
• ancestors:USC-title2-chapter25
§ Ancestor Siblings
• (parentId:USC OR parentId:USC-title2 OR
parentId:USC-title2-chapter25)
39
39. Contact
§ Paul Nelson
• pnelson@searchtechnologies.com
§ Ronald Matamoros
• rmatamoros@searchtechnologies.com
§ Search Technologies
• http://searchtechnologies.com
40