SlideShare a Scribd company logo
1 of 17
Download to read offline
Shrinking the Haystack" using Solr and OpenNLP
SHRINKING THE HAYSTACK WITH SOLR AND NLP
wes.caldwell@issinc.com

Wes Caldwell

@caldwellw

Chief Architect, Intelligent Software Solutions

http://linkd.in/1cfOR79
Topics
• 
• 
• 
• 
• 
• 
• 

Introduction to ISS, and our customer base
The data challenges our customers are facing
Our data processing pipeline (and how Solr and NLP fit in)
The document processing eco-system
Additional Solr features that we find useful
NLP techniques we use
Why we use multiple NLP techniques and how they complement
each other
•  Quick demo
About ISS
§ 

Headquartered	
  in	
  Colorado	
  Springs	
  
§ 

§ 

Other	
  offices	
  located	
  in	
  Washington	
  DC,	
  Hampton	
  
VA,	
  Tampa	
  FL,	
  and	
  Rome	
  NY	
  

InnovaEve	
  SoluEons	
  from	
  “Space	
  to	
  Mud	
  and	
  
Everything	
  Between”	
  
Sole	
  prime	
  on	
  mulEple	
  Air	
  Force	
  Research	
  
Labs	
  programs	
  IDIQ	
  
§  Currently	
  ExecuEng	
  More	
  Than	
  100	
  
SoSware	
  Development	
  Projects	
  
§  Over	
  800	
  employees	
  	
  
§  Strength	
  in	
  SoluEons	
  Development	
  and	
  
Deployment	
  
§ 

§ 

Consistently	
  Recognized	
  as	
  a	
  Leader	
  

Recognized	
  as	
  a	
  DeloiXe	
  Fast	
  50	
  Colorado	
  
company	
  and	
  a	
  DeloiXe	
  Fast	
  500	
  company	
  
over	
  eight	
  consecuEve	
  years	
  
§  Three-­‐Eme	
  Inc.	
  Magazine	
  500	
  winner	
  
§  2009	
  Defense	
  Company	
  of	
  the	
  Year	
  
§ 
The data challenge
• 

• 
• 

Most electronic information is not relational, but unstructured
(textual, binary) or semi-structured (spreadsheet, RSS feed).
–  In 2007, the estimated information content of all human
knowledge was 295 exabytes (295 million terabytes)
–  Data production will be 44 times greater in 2020 than in 2009
•  Approximately 35 zetabytes total (35 billion terabytes)
–  A majority of the data produced in the future will be
unstructured
Unstructured data is easily processed by human beings, but is
more difficult for machines.
A tremendous amount of information and knowledge is dormant
within unstructured data.
Our customer’s data environment
•  Literally thousands of data sources/feeds from a variety of
strategic, national, and tactical sources
– 
– 
– 
– 
– 
– 

Media (documents, images, etc.)
Human Interactions
Geospatial
Open Source
Imagery/Video
Many more…
How our analysts feel
The need
• 
• 
• 

Analysts are looking to extract knowledge from the massive
heterogeneous data sets, providing “actionable intelligence”
Search and NLP techniques are key enablers to allow an analyst to
reliably search for the information they know about, and to assist them
in discovering the information they don’t know about
It is critical (especially in tactical environments) to provide tools to the
analyst that allow them to “shrink the haystack” to a more digestible size,
and seed that information into an analytics pipeline, targeted at a
particular problem domain (e.g. C-Terrorism, C-Narcotics, etc.)
–  Time-to-live on the relevance of data collected can be very short
–  Its not about finding the needle in the haystack, its about giving a trained
analyst the tools to present the most relevant information in a timely manner,
allowing them to make an informed decision
Where our journey led us
Our approach
Content	
  AcquisiEon	
  

Search/Discovery	
  

SemanEc	
  Enrichment	
  

Data	
  PerspecEves	
  
Data	
  

GazeXeers	
  
Structured	
  Content	
  
NLP	
  Pipeline	
  

Content	
  Index	
  
Content	
  Cache	
  
(Haystacks)	
  

Semi-­‐Structured	
  
Content	
  

Un-­‐Structured	
  
Content	
  

Tenets	
  
• 
• 
• 
• 

Connector	
  architecture	
  
Data	
  normalizaEon	
  
Data	
  staging	
  
Data	
  CompartmenEng	
  
(MulEple	
  Haystacks)	
  

Tenets	
  

•  OpEmized	
  Index	
  of	
  
Content	
  for	
  Search	
  and	
  
Discovery	
  of	
  Big	
  Data	
  
•  Analyst	
  Topics	
  that	
  “Shrink	
  
the	
  Haystack”	
  
•  Advanced	
  Search	
  Features	
  
(Facets,	
  Auto-­‐Complete,	
  
Tagging,	
  Comments,	
  etc.)	
  
•  SemanEc	
  (Synonym)	
  
Search	
  based	
  on	
  pluggable	
  
taxonomies	
  

Tenets	
  

CategorizaEon	
  
Named	
  
EnEty	
  
RecogniEon	
  
Clustering	
  

•  “Domain	
  Spaces”	
  that	
  
support	
  pluggable	
  enEty	
  
recogniEon	
  and	
  
categorizaEon	
  
•  ConEnuous	
  feedback	
  loop	
  
that	
  improves	
  the	
  system	
  
over	
  Eme	
  with	
  analyst	
  
input	
  
•  Lexicon-­‐based	
  analyEcs	
  
that	
  allows	
  for	
  targeted	
  
categorizaEon	
  across	
  
corpus	
  of	
  data	
  

Tenets	
  

•  Data	
  ReducEon	
  into	
  
focused	
  “Data	
  
Perspec<ves”	
  
•  Data	
  perspecEves	
  
stored	
  in	
  op<mized	
  
formats	
  (e.g.	
  Graph,	
  
Time	
  Series,	
  Geo,	
  etc.)	
  
for	
  the	
  quesEons	
  being	
  
asked	
  
•  Leveraging	
  industry-­‐
standard	
  parallel	
  
processing	
  frameworks	
  
for	
  scalable	
  analyEcs	
  
Document Processing Pipeline Eco-System

Content	
  
Management	
  	
  

Text	
  
ExtracEon	
  

Named-­‐EnEty	
  
RecogniEon	
  

GeospaEal	
  
Tagging	
  

Clustering/	
  
ClassificaEon	
  

Indexing	
  
Additional Solr features that we find useful
• 

• 

Synonym (aka “Semantic Search” to us)
–  Allows us to load in pre-defined hierarchical synonym sets(driven by lexicons) to provide
search that is tuned for a particular customer domain
•  For example, a search for “weapon” finds various gun types (AK-47, M-16)
–  Currently implemented at index time
–  Simple feature to implement, but has proven very powerful as a “practical analytic”
Geospatial resolution (used in NLP pipeline)
–  Loaded GeoNames dataset into a separate Solr core
–  Allows for quick lookups in geospatial entity resolution
•  e.g. resolving “Paris” to latitude/longitude based geo-coordinate
–  Can boost based on general rules, or customer-specific ones
•  For example, which “Paris” is it? The one in France or Texas?
–  Population could be the boost parameter that returns Paris, France over Paris,
Texas
•  Allows us to easily override for local conditions
–  For example, if a customer wants all geo resolution to be focused in a
particular region of the world (i.e. their AOR)
NLP techniques we use
• 

• 

• 

Leverage Unstructured Information Management Architecture (UIMA) for NLP pipeline
–  Supports analysis engines for both GATE/Gazetteer and OpenNLP/SML
techniques
–  Starting to use UIMA-AS (Asynch) to help in scaling out various pipeline steps
–  Abstracts vendor-specific NLP engine details, hence allowing you to plug in
different implementations without much disruption
GATE/Gazetteer approach
–  Essentially Dictionaries containing key terms used for categorization (facets)
–  Can have n number of “categories” that are generic, as well as customer domain
defined
OpenNLP/Supervised Machine Learning approach
–  “Context aware” models that are trained by data scientists/SMEs
–  Based on probabilistic theory (Maximum Entropy)
Why use both NLP approaches?
• 
• 

• 

• 

Both approaches have their pro/cons
Gazetteer approach
–  Pros
•  Good precision – you are going to find what is important to you
•  Simple for analyst to “tune” - does not require a data scientist
•  Quick and easy to add new categories to a problem domain
–  Cons
•  Only as good as the gazetteer
•  Not context aware
Supervised Machine Learning approach
–  Pros
•  Once properly trained, good at finding new concepts in context
–  Cons
•  Requires a data scientist/SME to produce quality models
•  Can be tedious to train
Bottom-line – A combined approach helps you find the things you know are relevant, and also helps you find
things that are relevant that you may not know about
Additional information
• 
• 
• 
• 
• 
• 
• 

Apache Jackrabbit - http://jackrabbit.apache.org/
UIMA - http://uima.apache.org/
GATE - http://gate.ac.uk/
OpenNLP - http://opennlp.apache.org/
Boilerpipe - https://code.google.com/p/boilerpipe/
Apache Tika - http://tika.apache.org/
Geonames - http://www.geonames.org/
Demo
Questions?

More Related Content

What's hot

Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Steve Rowe
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Pythonbotsplash.com
 
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...Grokking VN
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)WingChan46
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsTrey Grainger
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Enginelucenerevolution
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...Lucidworks
 

What's hot (20)

Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...Using OpenNLP with Solr to improve search relevance and to extract named enti...
Using OpenNLP with Solr to improve search relevance and to extract named enti...
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
 
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Engine
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
 

Viewers also liked

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyLucidworks
 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 University of Torino
 
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...DataStax Academy
 
UIMA
UIMAUIMA
UIMAotisg
 
Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrLucidworks
 
Chloe and the Realtime Web
Chloe and the Realtime WebChloe and the Realtime Web
Chloe and the Realtime WebTrotter Cashion
 
Hyperdex - A closer look
Hyperdex - A closer lookHyperdex - A closer look
Hyperdex - A closer lookDECK36
 
Riak Search - Erlang Factory London 2010
Riak Search - Erlang Factory London 2010Riak Search - Erlang Factory London 2010
Riak Search - Erlang Factory London 2010Rusty Klophaus
 
Blazes: coordination analysis for distributed programs
Blazes: coordination analysis for distributed programsBlazes: coordination analysis for distributed programs
Blazes: coordination analysis for distributed programspalvaro
 
LXC, Docker, and the future of software delivery | LinuxCon 2013
LXC, Docker, and the future of software delivery | LinuxCon 2013LXC, Docker, and the future of software delivery | LinuxCon 2013
LXC, Docker, and the future of software delivery | LinuxCon 2013dotCloud
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseRobert Lujo
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)Pavlo Baron
 
Complex Legacy System Archiving/Data Retention with MongoDB and Xquery
Complex Legacy System Archiving/Data Retention with MongoDB and XqueryComplex Legacy System Archiving/Data Retention with MongoDB and Xquery
Complex Legacy System Archiving/Data Retention with MongoDB and XqueryDATAVERSITY
 
Spring Cleaning for Your Smartphone
Spring Cleaning for Your SmartphoneSpring Cleaning for Your Smartphone
Spring Cleaning for Your SmartphoneLookout
 
Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)thetechnicalweb
 

Viewers also liked (20)

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
 
Pablo Duboue
Pablo DubouePablo Duboue
Pablo Duboue
 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
 
Pycon16 draft
Pycon16 draftPycon16 draft
Pycon16 draft
 
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
 
UIMA
UIMAUIMA
UIMA
 
Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with Solr
 
Chloe and the Realtime Web
Chloe and the Realtime WebChloe and the Realtime Web
Chloe and the Realtime Web
 
Hyperdex - A closer look
Hyperdex - A closer lookHyperdex - A closer look
Hyperdex - A closer look
 
Riak Search - Erlang Factory London 2010
Riak Search - Erlang Factory London 2010Riak Search - Erlang Factory London 2010
Riak Search - Erlang Factory London 2010
 
Blazes: coordination analysis for distributed programs
Blazes: coordination analysis for distributed programsBlazes: coordination analysis for distributed programs
Blazes: coordination analysis for distributed programs
 
LXC, Docker, and the future of software delivery | LinuxCon 2013
LXC, Docker, and the future of software delivery | LinuxCon 2013LXC, Docker, and the future of software delivery | LinuxCon 2013
LXC, Docker, and the future of software delivery | LinuxCon 2013
 
Brunch With Coffee
Brunch With CoffeeBrunch With Coffee
Brunch With Coffee
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)
 
Complex Legacy System Archiving/Data Retention with MongoDB and Xquery
Complex Legacy System Archiving/Data Retention with MongoDB and XqueryComplex Legacy System Archiving/Data Retention with MongoDB and Xquery
Complex Legacy System Archiving/Data Retention with MongoDB and Xquery
 
Spring Cleaning for Your Smartphone
Spring Cleaning for Your SmartphoneSpring Cleaning for Your Smartphone
Spring Cleaning for Your Smartphone
 
NkSIP: The Erlang SIP application server
NkSIP: The Erlang SIP application serverNkSIP: The Erlang SIP application server
NkSIP: The Erlang SIP application server
 
Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)
 

Similar to Shrinking the Haystack" using Solr and OpenNLP

II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)Lakshya Gupta
 
Systems analysis and design
Systems analysis and designSystems analysis and design
Systems analysis and designArnel Llemit
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentRTTS
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallDr. Haxel Consult
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 
KAI, the Information Specialist
KAI, the Information SpecialistKAI, the Information Specialist
KAI, the Information Specialistaik762
 

Similar to Shrinking the Haystack" using Solr and OpenNLP (20)

II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)
 
Systems analysis and design
Systems analysis and designSystems analysis and design
Systems analysis and design
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
KAI, the Information Specialist
KAI, the Information SpecialistKAI, the Information Specialist
KAI, the Information Specialist
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdfPaige Cruz
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 

Recently uploaded (20)

Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 

Shrinking the Haystack" using Solr and OpenNLP

  • 2. SHRINKING THE HAYSTACK WITH SOLR AND NLP wes.caldwell@issinc.com Wes Caldwell @caldwellw Chief Architect, Intelligent Software Solutions http://linkd.in/1cfOR79
  • 3. Topics •  •  •  •  •  •  •  Introduction to ISS, and our customer base The data challenges our customers are facing Our data processing pipeline (and how Solr and NLP fit in) The document processing eco-system Additional Solr features that we find useful NLP techniques we use Why we use multiple NLP techniques and how they complement each other •  Quick demo
  • 4. About ISS §  Headquartered  in  Colorado  Springs   §  §  Other  offices  located  in  Washington  DC,  Hampton   VA,  Tampa  FL,  and  Rome  NY   InnovaEve  SoluEons  from  “Space  to  Mud  and   Everything  Between”   Sole  prime  on  mulEple  Air  Force  Research   Labs  programs  IDIQ   §  Currently  ExecuEng  More  Than  100   SoSware  Development  Projects   §  Over  800  employees     §  Strength  in  SoluEons  Development  and   Deployment   §  §  Consistently  Recognized  as  a  Leader   Recognized  as  a  DeloiXe  Fast  50  Colorado   company  and  a  DeloiXe  Fast  500  company   over  eight  consecuEve  years   §  Three-­‐Eme  Inc.  Magazine  500  winner   §  2009  Defense  Company  of  the  Year   § 
  • 5. The data challenge •  •  •  Most electronic information is not relational, but unstructured (textual, binary) or semi-structured (spreadsheet, RSS feed). –  In 2007, the estimated information content of all human knowledge was 295 exabytes (295 million terabytes) –  Data production will be 44 times greater in 2020 than in 2009 •  Approximately 35 zetabytes total (35 billion terabytes) –  A majority of the data produced in the future will be unstructured Unstructured data is easily processed by human beings, but is more difficult for machines. A tremendous amount of information and knowledge is dormant within unstructured data.
  • 6. Our customer’s data environment •  Literally thousands of data sources/feeds from a variety of strategic, national, and tactical sources –  –  –  –  –  –  Media (documents, images, etc.) Human Interactions Geospatial Open Source Imagery/Video Many more…
  • 8. The need •  •  •  Analysts are looking to extract knowledge from the massive heterogeneous data sets, providing “actionable intelligence” Search and NLP techniques are key enablers to allow an analyst to reliably search for the information they know about, and to assist them in discovering the information they don’t know about It is critical (especially in tactical environments) to provide tools to the analyst that allow them to “shrink the haystack” to a more digestible size, and seed that information into an analytics pipeline, targeted at a particular problem domain (e.g. C-Terrorism, C-Narcotics, etc.) –  Time-to-live on the relevance of data collected can be very short –  Its not about finding the needle in the haystack, its about giving a trained analyst the tools to present the most relevant information in a timely manner, allowing them to make an informed decision
  • 10. Our approach Content  AcquisiEon   Search/Discovery   SemanEc  Enrichment   Data  PerspecEves   Data   GazeXeers   Structured  Content   NLP  Pipeline   Content  Index   Content  Cache   (Haystacks)   Semi-­‐Structured   Content   Un-­‐Structured   Content   Tenets   •  •  •  •  Connector  architecture   Data  normalizaEon   Data  staging   Data  CompartmenEng   (MulEple  Haystacks)   Tenets   •  OpEmized  Index  of   Content  for  Search  and   Discovery  of  Big  Data   •  Analyst  Topics  that  “Shrink   the  Haystack”   •  Advanced  Search  Features   (Facets,  Auto-­‐Complete,   Tagging,  Comments,  etc.)   •  SemanEc  (Synonym)   Search  based  on  pluggable   taxonomies   Tenets   CategorizaEon   Named   EnEty   RecogniEon   Clustering   •  “Domain  Spaces”  that   support  pluggable  enEty   recogniEon  and   categorizaEon   •  ConEnuous  feedback  loop   that  improves  the  system   over  Eme  with  analyst   input   •  Lexicon-­‐based  analyEcs   that  allows  for  targeted   categorizaEon  across   corpus  of  data   Tenets   •  Data  ReducEon  into   focused  “Data   Perspec<ves”   •  Data  perspecEves   stored  in  op<mized   formats  (e.g.  Graph,   Time  Series,  Geo,  etc.)   for  the  quesEons  being   asked   •  Leveraging  industry-­‐ standard  parallel   processing  frameworks   for  scalable  analyEcs  
  • 11. Document Processing Pipeline Eco-System Content   Management     Text   ExtracEon   Named-­‐EnEty   RecogniEon   GeospaEal   Tagging   Clustering/   ClassificaEon   Indexing  
  • 12. Additional Solr features that we find useful •  •  Synonym (aka “Semantic Search” to us) –  Allows us to load in pre-defined hierarchical synonym sets(driven by lexicons) to provide search that is tuned for a particular customer domain •  For example, a search for “weapon” finds various gun types (AK-47, M-16) –  Currently implemented at index time –  Simple feature to implement, but has proven very powerful as a “practical analytic” Geospatial resolution (used in NLP pipeline) –  Loaded GeoNames dataset into a separate Solr core –  Allows for quick lookups in geospatial entity resolution •  e.g. resolving “Paris” to latitude/longitude based geo-coordinate –  Can boost based on general rules, or customer-specific ones •  For example, which “Paris” is it? The one in France or Texas? –  Population could be the boost parameter that returns Paris, France over Paris, Texas •  Allows us to easily override for local conditions –  For example, if a customer wants all geo resolution to be focused in a particular region of the world (i.e. their AOR)
  • 13. NLP techniques we use •  •  •  Leverage Unstructured Information Management Architecture (UIMA) for NLP pipeline –  Supports analysis engines for both GATE/Gazetteer and OpenNLP/SML techniques –  Starting to use UIMA-AS (Asynch) to help in scaling out various pipeline steps –  Abstracts vendor-specific NLP engine details, hence allowing you to plug in different implementations without much disruption GATE/Gazetteer approach –  Essentially Dictionaries containing key terms used for categorization (facets) –  Can have n number of “categories” that are generic, as well as customer domain defined OpenNLP/Supervised Machine Learning approach –  “Context aware” models that are trained by data scientists/SMEs –  Based on probabilistic theory (Maximum Entropy)
  • 14. Why use both NLP approaches? •  •  •  •  Both approaches have their pro/cons Gazetteer approach –  Pros •  Good precision – you are going to find what is important to you •  Simple for analyst to “tune” - does not require a data scientist •  Quick and easy to add new categories to a problem domain –  Cons •  Only as good as the gazetteer •  Not context aware Supervised Machine Learning approach –  Pros •  Once properly trained, good at finding new concepts in context –  Cons •  Requires a data scientist/SME to produce quality models •  Can be tedious to train Bottom-line – A combined approach helps you find the things you know are relevant, and also helps you find things that are relevant that you may not know about
  • 15. Additional information •  •  •  •  •  •  •  Apache Jackrabbit - http://jackrabbit.apache.org/ UIMA - http://uima.apache.org/ GATE - http://gate.ac.uk/ OpenNLP - http://opennlp.apache.org/ Boilerpipe - https://code.google.com/p/boilerpipe/ Apache Tika - http://tika.apache.org/ Geonames - http://www.geonames.org/
  • 16. Demo