SlideShare a Scribd company logo
1 of 17
Download to read offline
Shrinking the haystack   wes caldwell - final
SHRINKING THE HAYSTACK WITH SOLR AND NLP
wes.caldwell@issinc.com

Wes Caldwell

@caldwellw

Chief Architect, Intelligent Software Solutions

http://linkd.in/1cfOR79
Topics
• 
• 
• 
• 
• 
• 
• 

Introduction to ISS, and our customer base
The data challenges our customers are facing
Our data processing pipeline (and how Solr and NLP fit in)
The document processing eco-system
Additional Solr features that we find useful
NLP techniques we use
Why we use multiple NLP techniques and how they complement
each other
•  Quick demo
About ISS
§ 

Headquartered	
  in	
  Colorado	
  Springs	
  
§ 

§ 

Other	
  offices	
  located	
  in	
  Washington	
  DC,	
  Hampton	
  
VA,	
  Tampa	
  FL,	
  and	
  Rome	
  NY	
  

InnovaEve	
  SoluEons	
  from	
  “Space	
  to	
  Mud	
  and	
  
Everything	
  Between”	
  
Sole	
  prime	
  on	
  mulEple	
  Air	
  Force	
  Research	
  
Labs	
  programs	
  IDIQ	
  
§  Currently	
  ExecuEng	
  More	
  Than	
  100	
  
SoSware	
  Development	
  Projects	
  
§  Over	
  800	
  employees	
  	
  
§  Strength	
  in	
  SoluEons	
  Development	
  and	
  
Deployment	
  
§ 

§ 

Consistently	
  Recognized	
  as	
  a	
  Leader	
  

Recognized	
  as	
  a	
  DeloiXe	
  Fast	
  50	
  Colorado	
  
company	
  and	
  a	
  DeloiXe	
  Fast	
  500	
  company	
  
over	
  eight	
  consecuEve	
  years	
  
§  Three-­‐Eme	
  Inc.	
  Magazine	
  500	
  winner	
  
§  2009	
  Defense	
  Company	
  of	
  the	
  Year	
  
§ 
The data challenge
• 

• 
• 

Most electronic information is not relational, but unstructured
(textual, binary) or semi-structured (spreadsheet, RSS feed).
–  In 2007, the estimated information content of all human
knowledge was 295 exabytes (295 million terabytes)
–  Data production will be 44 times greater in 2020 than in 2009
•  Approximately 35 zetabytes total (35 billion terabytes)
–  A majority of the data produced in the future will be
unstructured
Unstructured data is easily processed by human beings, but is
more difficult for machines.
A tremendous amount of information and knowledge is dormant
within unstructured data.
Our customer’s data environment
•  Literally thousands of data sources/feeds from a variety of
strategic, national, and tactical sources
– 
– 
– 
– 
– 
– 

Media (documents, images, etc.)
Human Interactions
Geospatial
Open Source
Imagery/Video
Many more…
How our analysts feel
The need
• 
• 
• 

Analysts are looking to extract knowledge from the massive
heterogeneous data sets, providing “actionable intelligence”
Search and NLP techniques are key enablers to allow an analyst to
reliably search for the information they know about, and to assist them
in discovering the information they don’t know about
It is critical (especially in tactical environments) to provide tools to the
analyst that allow them to “shrink the haystack” to a more digestible size,
and seed that information into an analytics pipeline, targeted at a
particular problem domain (e.g. C-Terrorism, C-Narcotics, etc.)
–  Time-to-live on the relevance of data collected can be very short
–  Its not about finding the needle in the haystack, its about giving a trained
analyst the tools to present the most relevant information in a timely manner,
allowing them to make an informed decision
Where our journey led us
Our approach
Content	
  AcquisiEon	
  

Search/Discovery	
  

SemanEc	
  Enrichment	
  

Data	
  PerspecEves	
  
Data	
  

GazeXeers	
  
Structured	
  Content	
  
NLP	
  Pipeline	
  

Content	
  Index	
  
Content	
  Cache	
  
(Haystacks)	
  

Semi-­‐Structured	
  
Content	
  

Un-­‐Structured	
  
Content	
  

Tenets	
  
• 
• 
• 
• 

Connector	
  architecture	
  
Data	
  normalizaEon	
  
Data	
  staging	
  
Data	
  CompartmenEng	
  
(MulEple	
  Haystacks)	
  

Tenets	
  

•  OpEmized	
  Index	
  of	
  
Content	
  for	
  Search	
  and	
  
Discovery	
  of	
  Big	
  Data	
  
•  Analyst	
  Topics	
  that	
  “Shrink	
  
the	
  Haystack”	
  
•  Advanced	
  Search	
  Features	
  
(Facets,	
  Auto-­‐Complete,	
  
Tagging,	
  Comments,	
  etc.)	
  
•  SemanEc	
  (Synonym)	
  
Search	
  based	
  on	
  pluggable	
  
taxonomies	
  

Tenets	
  

CategorizaEon	
  
Named	
  
EnEty	
  
RecogniEon	
  
Clustering	
  

•  “Domain	
  Spaces”	
  that	
  
support	
  pluggable	
  enEty	
  
recogniEon	
  and	
  
categorizaEon	
  
•  ConEnuous	
  feedback	
  loop	
  
that	
  improves	
  the	
  system	
  
over	
  Eme	
  with	
  analyst	
  
input	
  
•  Lexicon-­‐based	
  analyEcs	
  
that	
  allows	
  for	
  targeted	
  
categorizaEon	
  across	
  
corpus	
  of	
  data	
  

Tenets	
  

•  Data	
  ReducEon	
  into	
  
focused	
  “Data	
  
Perspec<ves”	
  
•  Data	
  perspecEves	
  
stored	
  in	
  op<mized	
  
formats	
  (e.g.	
  Graph,	
  
Time	
  Series,	
  Geo,	
  etc.)	
  
for	
  the	
  quesEons	
  being	
  
asked	
  
•  Leveraging	
  industry-­‐
standard	
  parallel	
  
processing	
  frameworks	
  
for	
  scalable	
  analyEcs	
  
Document Processing Pipeline Eco-System

Content	
  
Management	
  	
  

Text	
  
ExtracEon	
  

Named-­‐EnEty	
  
RecogniEon	
  

GeospaEal	
  
Tagging	
  

Clustering/	
  
ClassificaEon	
  

Indexing	
  
Additional Solr features that we find useful
• 

• 

Synonym (aka “Semantic Search” to us)
–  Allows us to load in pre-defined hierarchical synonym sets(driven by lexicons) to provide
search that is tuned for a particular customer domain
•  For example, a search for “weapon” finds various gun types (AK-47, M-16)
–  Currently implemented at index time
–  Simple feature to implement, but has proven very powerful as a “practical analytic”
Geospatial resolution (used in NLP pipeline)
–  Loaded GeoNames dataset into a separate Solr core
–  Allows for quick lookups in geospatial entity resolution
•  e.g. resolving “Paris” to latitude/longitude based geo-coordinate
–  Can boost based on general rules, or customer-specific ones
•  For example, which “Paris” is it? The one in France or Texas?
–  Population could be the boost parameter that returns Paris, France over Paris,
Texas
•  Allows us to easily override for local conditions
–  For example, if a customer wants all geo resolution to be focused in a
particular region of the world (i.e. their AOR)
NLP techniques we use
• 

• 

• 

Leverage Unstructured Information Management Architecture (UIMA) for NLP pipeline
–  Supports analysis engines for both GATE/Gazetteer and OpenNLP/SML
techniques
–  Starting to use UIMA-AS (Asynch) to help in scaling out various pipeline steps
–  Abstracts vendor-specific NLP engine details, hence allowing you to plug in
different implementations without much disruption
GATE/Gazetteer approach
–  Essentially Dictionaries containing key terms used for categorization (facets)
–  Can have n number of “categories” that are generic, as well as customer domain
defined
OpenNLP/Supervised Machine Learning approach
–  “Context aware” models that are trained by data scientists/SMEs
–  Based on probabilistic theory (Maximum Entropy)
Why use both NLP approaches?
• 
• 

• 

• 

Both approaches have their pro/cons
Gazetteer approach
–  Pros
•  Good precision – you are going to find what is important to you
•  Simple for analyst to “tune” - does not require a data scientist
•  Quick and easy to add new categories to a problem domain
–  Cons
•  Only as good as the gazetteer
•  Not context aware
Supervised Machine Learning approach
–  Pros
•  Once properly trained, good at finding new concepts in context
–  Cons
•  Requires a data scientist/SME to produce quality models
•  Can be tedious to train
Bottom-line – A combined approach helps you find the things you know are relevant, and also helps you find
things that are relevant that you may not know about
Additional information
• 
• 
• 
• 
• 
• 
• 

Apache Jackrabbit - http://jackrabbit.apache.org/
UIMA - http://uima.apache.org/
GATE - http://gate.ac.uk/
OpenNLP - http://opennlp.apache.org/
Boilerpipe - https://code.google.com/p/boilerpipe/
Apache Tika - http://tika.apache.org/
Geonames - http://www.geonames.org/
Demo
Questions?

More Related Content

Similar to Shrinking the haystack wes caldwell - final

II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)Lakshya Gupta
 
Systems analysis and design
Systems analysis and designSystems analysis and design
Systems analysis and designArnel Llemit
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentRTTS
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallDr. Haxel Consult
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 

Similar to Shrinking the haystack wes caldwell - final (20)

II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)
 
Systems analysis and design
Systems analysis and designSystems analysis and design
Systems analysis and design
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Recently uploaded

How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17Celine George
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesMohammad Hassany
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17Celine George
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphNetziValdelomar1
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 

Recently uploaded (20)

Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17
 
Human-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming ClassesHuman-AI Co-Creation of Worked Examples for Programming Classes
Human-AI Co-Creation of Worked Examples for Programming Classes
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a Paragraph
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 

Shrinking the haystack wes caldwell - final

  • 2. SHRINKING THE HAYSTACK WITH SOLR AND NLP wes.caldwell@issinc.com Wes Caldwell @caldwellw Chief Architect, Intelligent Software Solutions http://linkd.in/1cfOR79
  • 3. Topics •  •  •  •  •  •  •  Introduction to ISS, and our customer base The data challenges our customers are facing Our data processing pipeline (and how Solr and NLP fit in) The document processing eco-system Additional Solr features that we find useful NLP techniques we use Why we use multiple NLP techniques and how they complement each other •  Quick demo
  • 4. About ISS §  Headquartered  in  Colorado  Springs   §  §  Other  offices  located  in  Washington  DC,  Hampton   VA,  Tampa  FL,  and  Rome  NY   InnovaEve  SoluEons  from  “Space  to  Mud  and   Everything  Between”   Sole  prime  on  mulEple  Air  Force  Research   Labs  programs  IDIQ   §  Currently  ExecuEng  More  Than  100   SoSware  Development  Projects   §  Over  800  employees     §  Strength  in  SoluEons  Development  and   Deployment   §  §  Consistently  Recognized  as  a  Leader   Recognized  as  a  DeloiXe  Fast  50  Colorado   company  and  a  DeloiXe  Fast  500  company   over  eight  consecuEve  years   §  Three-­‐Eme  Inc.  Magazine  500  winner   §  2009  Defense  Company  of  the  Year   § 
  • 5. The data challenge •  •  •  Most electronic information is not relational, but unstructured (textual, binary) or semi-structured (spreadsheet, RSS feed). –  In 2007, the estimated information content of all human knowledge was 295 exabytes (295 million terabytes) –  Data production will be 44 times greater in 2020 than in 2009 •  Approximately 35 zetabytes total (35 billion terabytes) –  A majority of the data produced in the future will be unstructured Unstructured data is easily processed by human beings, but is more difficult for machines. A tremendous amount of information and knowledge is dormant within unstructured data.
  • 6. Our customer’s data environment •  Literally thousands of data sources/feeds from a variety of strategic, national, and tactical sources –  –  –  –  –  –  Media (documents, images, etc.) Human Interactions Geospatial Open Source Imagery/Video Many more…
  • 8. The need •  •  •  Analysts are looking to extract knowledge from the massive heterogeneous data sets, providing “actionable intelligence” Search and NLP techniques are key enablers to allow an analyst to reliably search for the information they know about, and to assist them in discovering the information they don’t know about It is critical (especially in tactical environments) to provide tools to the analyst that allow them to “shrink the haystack” to a more digestible size, and seed that information into an analytics pipeline, targeted at a particular problem domain (e.g. C-Terrorism, C-Narcotics, etc.) –  Time-to-live on the relevance of data collected can be very short –  Its not about finding the needle in the haystack, its about giving a trained analyst the tools to present the most relevant information in a timely manner, allowing them to make an informed decision
  • 10. Our approach Content  AcquisiEon   Search/Discovery   SemanEc  Enrichment   Data  PerspecEves   Data   GazeXeers   Structured  Content   NLP  Pipeline   Content  Index   Content  Cache   (Haystacks)   Semi-­‐Structured   Content   Un-­‐Structured   Content   Tenets   •  •  •  •  Connector  architecture   Data  normalizaEon   Data  staging   Data  CompartmenEng   (MulEple  Haystacks)   Tenets   •  OpEmized  Index  of   Content  for  Search  and   Discovery  of  Big  Data   •  Analyst  Topics  that  “Shrink   the  Haystack”   •  Advanced  Search  Features   (Facets,  Auto-­‐Complete,   Tagging,  Comments,  etc.)   •  SemanEc  (Synonym)   Search  based  on  pluggable   taxonomies   Tenets   CategorizaEon   Named   EnEty   RecogniEon   Clustering   •  “Domain  Spaces”  that   support  pluggable  enEty   recogniEon  and   categorizaEon   •  ConEnuous  feedback  loop   that  improves  the  system   over  Eme  with  analyst   input   •  Lexicon-­‐based  analyEcs   that  allows  for  targeted   categorizaEon  across   corpus  of  data   Tenets   •  Data  ReducEon  into   focused  “Data   Perspec<ves”   •  Data  perspecEves   stored  in  op<mized   formats  (e.g.  Graph,   Time  Series,  Geo,  etc.)   for  the  quesEons  being   asked   •  Leveraging  industry-­‐ standard  parallel   processing  frameworks   for  scalable  analyEcs  
  • 11. Document Processing Pipeline Eco-System Content   Management     Text   ExtracEon   Named-­‐EnEty   RecogniEon   GeospaEal   Tagging   Clustering/   ClassificaEon   Indexing  
  • 12. Additional Solr features that we find useful •  •  Synonym (aka “Semantic Search” to us) –  Allows us to load in pre-defined hierarchical synonym sets(driven by lexicons) to provide search that is tuned for a particular customer domain •  For example, a search for “weapon” finds various gun types (AK-47, M-16) –  Currently implemented at index time –  Simple feature to implement, but has proven very powerful as a “practical analytic” Geospatial resolution (used in NLP pipeline) –  Loaded GeoNames dataset into a separate Solr core –  Allows for quick lookups in geospatial entity resolution •  e.g. resolving “Paris” to latitude/longitude based geo-coordinate –  Can boost based on general rules, or customer-specific ones •  For example, which “Paris” is it? The one in France or Texas? –  Population could be the boost parameter that returns Paris, France over Paris, Texas •  Allows us to easily override for local conditions –  For example, if a customer wants all geo resolution to be focused in a particular region of the world (i.e. their AOR)
  • 13. NLP techniques we use •  •  •  Leverage Unstructured Information Management Architecture (UIMA) for NLP pipeline –  Supports analysis engines for both GATE/Gazetteer and OpenNLP/SML techniques –  Starting to use UIMA-AS (Asynch) to help in scaling out various pipeline steps –  Abstracts vendor-specific NLP engine details, hence allowing you to plug in different implementations without much disruption GATE/Gazetteer approach –  Essentially Dictionaries containing key terms used for categorization (facets) –  Can have n number of “categories” that are generic, as well as customer domain defined OpenNLP/Supervised Machine Learning approach –  “Context aware” models that are trained by data scientists/SMEs –  Based on probabilistic theory (Maximum Entropy)
  • 14. Why use both NLP approaches? •  •  •  •  Both approaches have their pro/cons Gazetteer approach –  Pros •  Good precision – you are going to find what is important to you •  Simple for analyst to “tune” - does not require a data scientist •  Quick and easy to add new categories to a problem domain –  Cons •  Only as good as the gazetteer •  Not context aware Supervised Machine Learning approach –  Pros •  Once properly trained, good at finding new concepts in context –  Cons •  Requires a data scientist/SME to produce quality models •  Can be tedious to train Bottom-line – A combined approach helps you find the things you know are relevant, and also helps you find things that are relevant that you may not know about
  • 15. Additional information •  •  •  •  •  •  •  Apache Jackrabbit - http://jackrabbit.apache.org/ UIMA - http://uima.apache.org/ GATE - http://gate.ac.uk/ OpenNLP - http://opennlp.apache.org/ Boilerpipe - https://code.google.com/p/boilerpipe/ Apache Tika - http://tika.apache.org/ Geonames - http://www.geonames.org/
  • 16. Demo