Large Scale Search, Discovery and Analytics in Action

Large Scale Search, Discovery
and Analysis in Action

Grant Ingersoll
Chief Scientist

September 18, 2012

Confidential © Copyright 2012

Search is Dead, Long Live Search

• Good keyword search is a
commodity and easy to get
up and running Documents

• The Bar is Raised
• Relevance is (always will
be?) hard
Content User
• Holistic view of the data Relationships Interaction

AND the users is critical

• Search, Discovery and
Analytics are the key to
unlocking this view of users Access
and data

Confidential and Proprietary
© 2012 LucidWorks

Topics

• Background and needs

• Architecture
• Road Ahead

• SDA In Action
• Components
• Challenges and Lessons Learned

• Wrap Up
© 2012 LucidWorks

Sample Use Cases

• Claims processing and analysis, including fraud analysis

• Large scale content acquisition and access for:
• Defense, intelligence and pharmaceutical applications
• Views of data surrounding natural disasters and other tragedies
for research, archiving and therapeutic purposes

• Analysis of Website and social media interactions

• Access and processing of genetic information for improved
medical treatments

• Log processing and fraud detection in telecommunications
5 © 2012 LucidWorks

In Focus: Personalized Medicine

Alignment
and other Genetic
analysis Variations

Patient DNA

Standard Therapies

Alternative Therapies

Search and Faceting


In Focus: Log Processing in Telecommunications

• Each year, large sums of money are lost due to
fraudulent calls and poor service

• Logs are usually semi-structured and contain vital
information about errors and fraud

• Deeper batch analytics can provide insight into
patterns across vast amounts of data

• Search of call and network information (via logs) is
critical to providing deeper analysis and understanding
of these errors and fraudulent activities


Computation and Storage

LucidWorks
Hadoop HBase
Search/Solr
• Document Index • Stores Logs, • Metric Storage
Raw files,
• Faceting intermediate • User
files, etc. Histories/Profile
• WebHDFS
• SolrCloud
makes sharding • Document
easy • Small files are Storage
an unnatural act

Challenges
• Who is the authoritative store?
• Real time vs. Batch
• Where should analysis be done?
© 2012 LucidWorks

Search In Practice

• Three primary concerns
• Performance/Scaling

• Relevance

• Operations: monitoring, failover, etc.

• Business typically cares more about relevance

• Devs care more about performance at first…

© 2012 LucidWorks

Search: Relevance

• Always Be Testing
• Experiment management is critical
• Top X + sampling
• Click Logs
• Track Everything!
• Queries
• Clicks
• Displayed Documents
• Mouse/Scroll tracking?
• Phrases are your friends
© 2012 LucidWorks

Discovery Components

Serendipity Organization Data Quality

• Trends • Importance • Document factor
• Topics • Clustering Distributions
• Recommendations • Classification • Length
• Related Items • Named Entities • Boosts
• More Like This • Time Factors • Duplicates
• Did you mean? • Faceting
• Stat. Interesting
Phrases

Challenges
• Many of these are intense calculations or iterative
• Many are subjective and require a lot of experimentation

© 2012 LucidWorks

Discovery with Mahout

• Mahout’s 3 “C”s provide tools for helping across many
aspects of discovery
• Collaborative Filtering
• Classification
• Clustering
• Also:
• Collocations (Statistically Interesting Phrases)
• Singular Value Decomposition (SVD)
• Others
• Challenges:
• High cost to iterative machine learning algorithms
• Mahout is very command line oriented
• Some areas less mature
© 2012 LucidWorks

Aside: Experiment Management

• Plan for running experiments from the beginning
across Search and Discovery components
• Your engine should help!
• Types of Experiments to consider
• Indexing/Analysis
• Query parsing
• Scoring formulas
• Machine Learning Models
• Recommendations, many more
• Make it easy to do A/B testing across all experiments
and compare and contrast the results

© 2012 LucidWorks

Analytics in Practice

• Many of the components discussed provide analytical
features
• Leverage existing tools: R, etc.
• Simple Counts:
• Facets
• Term and Document frequencies
• Clicks
• Search and Discovery example metrics
• Relevance measures like Mean Reciprocal Rank
• Histograms/Drilldowns around Number of Results
• Log and navigation analysis

• Data cleanliness analysis is helpful for finding potential
issues in content

© 2012 LucidWorks

Wrap

• Search, Discovery and Analytics, when combined into
a single, coherent system provides powerful insight
into both your content and your users

• LucidWorks has combined many of these things into
LucidWorks Big Data
• http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based
applications

© 2012 LucidWorks

Discussion and Resources

• Questions?

• http://www.lucidworks.com

• grant@lucidworks.com
• @gsingers


Large Scale Search, Discovery and Analytics in Action

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large Scale Search, Discovery and Analytics in Action

Similar to Large Scale Search, Discovery and Analytics in Action (20)

More from Grant Ingersoll

More from Grant Ingersoll (20)

Recently uploaded

Recently uploaded (20)

Large Scale Search, Discovery and Analytics in Action

Editor's Notes