More Related Content Similar to Large Scale Search, Discovery and Analytics in Action (20) More from Grant Ingersoll (20) Large Scale Search, Discovery and Analytics in Action1. Large Scale Search, Discovery
and Analysis in Action
Grant Ingersoll
Chief Scientist
September 18, 2012
Confidential © Copyright 2012
2. Search is Dead, Long Live Search
• Good keyword search is a
commodity and easy to get
up and running Documents
• The Bar is Raised
• Relevance is (always will
be?) hard
Content User
• Holistic view of the data Relationships Interaction
AND the users is critical
• Search, Discovery and
Analytics are the key to
unlocking this view of users Access
and data
Confidential and Proprietary
© 2012 LucidWorks
3. Topics
• Background and needs
• Architecture
• Road Ahead
• SDA In Action
• Components
• Challenges and Lessons Learned
• Wrap Up
Confidential and Proprietary
© 2012 LucidWorks
5. Sample Use Cases
• Claims processing and analysis, including fraud analysis
• Large scale content acquisition and access for:
• Defense, intelligence and pharmaceutical applications
• Views of data surrounding natural disasters and other tragedies
for research, archiving and therapeutic purposes
• Analysis of Website and social media interactions
• Access and processing of genetic information for improved
medical treatments
• Log processing and fraud detection in telecommunications
Confidential and Proprietary
5 © 2012 LucidWorks
6. In Focus: Personalized Medicine
Alignment
and other Genetic
analysis Variations
Patient DNA
Standard Therapies
Alternative Therapies
Search and Faceting
Confidential and Proprietary
6 © 2012 LucidWorks
7. In Focus: Log Processing in Telecommunications
• Each year, large sums of money are lost due to
fraudulent calls and poor service
• Logs are usually semi-structured and contain vital
information about errors and fraud
• Deeper batch analytics can provide insight into
patterns across vast amounts of data
• Search of call and network information (via logs) is
critical to providing deeper analysis and understanding
of these errors and fraudulent activities
Confidential and Proprietary
7 © 2012 LucidWorks
13. Computation and Storage
LucidWorks
Hadoop HBase
Search/Solr
• Document Index • Stores Logs, • Metric Storage
Raw files,
• Faceting intermediate • User
files, etc. Histories/Profile
• WebHDFS
• SolrCloud
makes sharding • Document
easy • Small files are Storage
an unnatural act
Challenges
• Who is the authoritative store?
• Real time vs. Batch
• Where should analysis be done?
Confidential and Proprietary
© 2012 LucidWorks
14. Search In Practice
• Three primary concerns
• Performance/Scaling
• Relevance
• Operations: monitoring, failover, etc.
• Business typically cares more about relevance
• Devs care more about performance at first…
Confidential and Proprietary
© 2012 LucidWorks
15. Search: Relevance
• Always Be Testing
• Experiment management is critical
• Top X + sampling
• Click Logs
• Track Everything!
• Queries
• Clicks
• Displayed Documents
• Mouse/Scroll tracking?
• Phrases are your friends
Confidential and Proprietary
© 2012 LucidWorks
16. Discovery Components
Serendipity Organization Data Quality
• Trends • Importance • Document factor
• Topics • Clustering Distributions
• Recommendations • Classification • Length
• Related Items • Named Entities • Boosts
• More Like This • Time Factors • Duplicates
• Did you mean? • Faceting
• Stat. Interesting
Phrases
Challenges
• Many of these are intense calculations or iterative
• Many are subjective and require a lot of experimentation
Confidential and Proprietary
© 2012 LucidWorks
17. Discovery with Mahout
• Mahout’s 3 “C”s provide tools for helping across many
aspects of discovery
• Collaborative Filtering
• Classification
• Clustering
• Also:
• Collocations (Statistically Interesting Phrases)
• Singular Value Decomposition (SVD)
• Others
• Challenges:
• High cost to iterative machine learning algorithms
• Mahout is very command line oriented
• Some areas less mature
Confidential and Proprietary
© 2012 LucidWorks
18. Aside: Experiment Management
• Plan for running experiments from the beginning
across Search and Discovery components
• Your engine should help!
• Types of Experiments to consider
• Indexing/Analysis
• Query parsing
• Scoring formulas
• Machine Learning Models
• Recommendations, many more
• Make it easy to do A/B testing across all experiments
and compare and contrast the results
Confidential and Proprietary
© 2012 LucidWorks
19. Analytics in Practice
• Many of the components discussed provide analytical
features
• Leverage existing tools: R, etc.
• Simple Counts:
• Facets
• Term and Document frequencies
• Clicks
• Search and Discovery example metrics
• Relevance measures like Mean Reciprocal Rank
• Histograms/Drilldowns around Number of Results
• Log and navigation analysis
• Data cleanliness analysis is helpful for finding potential
issues in content
Confidential and Proprietary
© 2012 LucidWorks
20. Wrap
• Search, Discovery and Analytics, when combined into
a single, coherent system provides powerful insight
into both your content and your users
• LucidWorks has combined many of these things into
LucidWorks Big Data
• http://www.lucidworks.com/products/lucidworks-big-data
• Design for the big picture when building search-based
applications
Confidential and Proprietary
© 2012 LucidWorks
21. Discussion and Resources
• Questions?
• http://www.lucidworks.com
• grant@lucidworks.com
• @gsingers
Confidential and Proprietary
21 © 2012 LucidWorks
Editor's Notes The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding Make into images? All about ad hoc and bulk storage and computationAll about the analytics that drive your computationGlue to make it all work together – data where it needs to be when it needs to be thereAll are examples of ways to do this. There are actually a fair number of viable alternatives for all of these pieces, all in open sourceI tend to stick to Apache and “commercial” friendly licenses, where possible Analytics:Discovery:– Recommendations, trends, related searches Authoritative store: managing across, consistency, etc.Analysis should be done where it most makes sense given the location of the data and the type of analysis being doneHadoop and HBase stuff are all pretty straightforward Log and navigation: clicks, search trails, etc.Data cleanliness: Never viewed docs that are related to other documents Big Picture: too often devs are stuck in the weeds