SlideShare a Scribd company logo
1 of 21
Large Scale Search, Discovery
                                and Analysis in Action

                                Grant Ingersoll
                                Chief Scientist

                                September 18, 2012




Confidential © Copyright 2012
Search is Dead, Long Live Search

• Good keyword search is a
 commodity and easy to get
 up and running                                 Documents



• The Bar is Raised
 • Relevance is (always will
    be?) hard
                                  Content                      User
• Holistic view of the data     Relationships               Interaction

 AND the users is critical

• Search, Discovery and
 Analytics are the key to
 unlocking this view of users                    Access
 and data


 Confidential and Proprietary
 © 2012 LucidWorks
Topics

• Background and needs

• Architecture
• Road Ahead

• SDA In Action
   • Components
   • Challenges and Lessons Learned

• Wrap Up
Confidential and Proprietary
© 2012 LucidWorks
Confidential © Copyright 2012
Sample Use Cases

    • Claims processing and analysis, including fraud analysis

    • Large scale content acquisition and access for:
       • Defense, intelligence and pharmaceutical applications
       • Views of data surrounding natural disasters and other tragedies
          for research, archiving and therapeutic purposes

    • Analysis of Website and social media interactions

    • Access and processing of genetic information for improved
       medical treatments

    • Log processing and fraud detection in telecommunications
    Confidential and Proprietary
5   © 2012 LucidWorks
In Focus: Personalized Medicine


                                              Alignment
                                              and other     Genetic
                                               analysis    Variations


    Patient DNA


                                                          Standard Therapies




                                                          Alternative Therapies

                                   Search and Faceting

    Confidential and Proprietary
6   © 2012 LucidWorks
In Focus: Log Processing in Telecommunications


    • Each year, large sums of money are lost due to
       fraudulent calls and poor service

    • Logs are usually semi-structured and contain vital
       information about errors and fraud

    • Deeper batch analytics can provide insight into
       patterns across vast amounts of data

    • Search of call and network information (via logs) is
       critical to providing deeper analysis and understanding
       of these errors and fraudulent activities

    Confidential and Proprietary
7   © 2012 LucidWorks
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Confidential © Copyright 2012
Computation and Storage

       LucidWorks
                                     Hadoop                HBase
       Search/Solr
• Document Index                • Stores Logs,       • Metric Storage
                                  Raw files,
• Faceting                        intermediate       • User
                                  files, etc.          Histories/Profile
                                • WebHDFS
• SolrCloud
  makes sharding                                     • Document
  easy                          • Small files are      Storage
                                  an unnatural act

Challenges
    • Who is the authoritative store?
    • Real time vs. Batch
    • Where should analysis be done?
 Confidential and Proprietary
 © 2012 LucidWorks
Search In Practice

• Three primary concerns
   • Performance/Scaling

   • Relevance

   • Operations: monitoring, failover, etc.

• Business typically cares more about relevance

• Devs care more about performance at first…

Confidential and Proprietary
© 2012 LucidWorks
Search: Relevance

• Always Be Testing
  • Experiment management is critical
  • Top X + sampling
  • Click Logs
• Track Everything!
   • Queries
   • Clicks
   • Displayed Documents
   • Mouse/Scroll tracking?
• Phrases are your friends
Confidential and Proprietary
© 2012 LucidWorks
Discovery Components


          Serendipity              Organization         Data Quality

•   Trends                     • Importance         • Document factor
•   Topics                     • Clustering           Distributions
•   Recommendations            • Classification       • Length
•   Related Items                • Named Entities     • Boosts
•   More Like This             • Time Factors       • Duplicates
•   Did you mean?              • Faceting
•   Stat. Interesting
    Phrases

Challenges
• Many of these are intense calculations or iterative
• Many are subjective and require a lot of experimentation

Confidential and Proprietary
© 2012 LucidWorks
Discovery with Mahout

• Mahout’s 3 “C”s provide tools for helping across many
   aspects of discovery
   • Collaborative Filtering
   • Classification
   • Clustering
• Also:
   • Collocations (Statistically Interesting Phrases)
   • Singular Value Decomposition (SVD)
   • Others
• Challenges:
   • High cost to iterative machine learning algorithms
   • Mahout is very command line oriented
   • Some areas less mature
Confidential and Proprietary
© 2012 LucidWorks
Aside: Experiment Management

• Plan for running experiments from the beginning
   across Search and Discovery components
   • Your engine should help!
• Types of Experiments to consider
   • Indexing/Analysis
   • Query parsing
   • Scoring formulas
   • Machine Learning Models
   • Recommendations, many more
• Make it easy to do A/B testing across all experiments
   and compare and contrast the results



Confidential and Proprietary
© 2012 LucidWorks
Analytics in Practice

• Many of the components discussed provide analytical
  features
  • Leverage existing tools: R, etc.
• Simple Counts:
  • Facets
  • Term and Document frequencies
  • Clicks
• Search and Discovery example metrics
  • Relevance measures like Mean Reciprocal Rank
  • Histograms/Drilldowns around Number of Results
  • Log and navigation analysis

• Data cleanliness analysis is helpful for finding potential
   issues in content

Confidential and Proprietary
© 2012 LucidWorks
Wrap

• Search, Discovery and Analytics, when combined into
   a single, coherent system provides powerful insight
   into both your content and your users

• LucidWorks has combined many of these things into
   LucidWorks Big Data
   • http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based
   applications



Confidential and Proprietary
© 2012 LucidWorks
Discussion and Resources

     • Questions?


     • http://www.lucidworks.com

     • grant@lucidworks.com
     • @gsingers




     Confidential and Proprietary
21   © 2012 LucidWorks

More Related Content

What's hot

Managing Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_SharedManaging Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Rob Daley
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
SEAD
 

What's hot (20)

AI-SDV 2020: Biomax
AI-SDV 2020: BiomaxAI-SDV 2020: Biomax
AI-SDV 2020: Biomax
 
Writing successful data management plans
Writing successful data management plansWriting successful data management plans
Writing successful data management plans
 
Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...Southampton Web Science DTC - Innovations in web publishing and services for ...
Southampton Web Science DTC - Innovations in web publishing and services for ...
 
Planning for Research Data Managment
Planning for Research Data ManagmentPlanning for Research Data Managment
Planning for Research Data Managment
 
Practical and Conceptual Considerations of Research Object Preservation
Practical and Conceptual Considerations of Research Object PreservationPractical and Conceptual Considerations of Research Object Preservation
Practical and Conceptual Considerations of Research Object Preservation
 
OU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research dataOU Library Research Support webinar: Working with research data
OU Library Research Support webinar: Working with research data
 
Managing Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_SharedManaging Your Research Data for Maximum Impact -Rob Daley 300616_Shared
Managing Your Research Data for Maximum Impact -Rob Daley 300616_Shared
 
10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy
 
Research Data Management Plan: How to Write One - 2017-02-01 - University of ...
Research Data Management Plan: How to Write One - 2017-02-01 - University of ...Research Data Management Plan: How to Write One - 2017-02-01 - University of ...
Research Data Management Plan: How to Write One - 2017-02-01 - University of ...
 
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
 
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
From Ashley madison to_panama_papers - Office 365 bringing sexy back to eDisc...
 
Joy davidson-rdm-support-ual
Joy davidson-rdm-support-ualJoy davidson-rdm-support-ual
Joy davidson-rdm-support-ual
 
Open Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsOpen Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economics
 
HL7 FHIR FoundationTopics for Non-Developers
HL7 FHIR FoundationTopics for Non-DevelopersHL7 FHIR FoundationTopics for Non-Developers
HL7 FHIR FoundationTopics for Non-Developers
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
 
Introduction to Data Management Planning
Introduction to Data Management PlanningIntroduction to Data Management Planning
Introduction to Data Management Planning
 
Hadoop in Public Sector
Hadoop in Public SectorHadoop in Public Sector
Hadoop in Public Sector
 
RDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue DataRDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
RDAP 16 Poster: Expanding Research Data Services with Deep Blue Data
 
Information Models & FHIR --- It’s all about content!
Information Models & FHIR --- It’s all about content!Information Models & FHIR --- It’s all about content!
Information Models & FHIR --- It’s all about content!
 

Similar to Large Scale Search, Discovery and Analytics in Action

Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace data
Andrea Wiggins
 

Similar to Large Scale Search, Discovery and Analytics in Action (20)

Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTIONDATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research Data
 
Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Curlew Research Brussels 2014 Electronic Data & Knowledge Management
Curlew Research Brussels 2014 Electronic Data & Knowledge ManagementCurlew Research Brussels 2014 Electronic Data & Knowledge Management
Curlew Research Brussels 2014 Electronic Data & Knowledge Management
 
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
Comprehensive Security for the Enterprise IV: Visibility Through a Single End...
 
When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your EnterpriseWhen to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise
 
When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise When to Consider Semantic Technology for Your Enterprise
When to Consider Semantic Technology for Your Enterprise
 
Digital Preservation - Manage and Provide Access
Digital Preservation - Manage and Provide AccessDigital Preservation - Manage and Provide Access
Digital Preservation - Manage and Provide Access
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Data anywhere anytime
Data anywhere anytimeData anywhere anytime
Data anywhere anytime
 
Chapter 6
Chapter 6Chapter 6
Chapter 6
 
Enterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in productionEnterprise ready: a look at Neo4j in production
Enterprise ready: a look at Neo4j in production
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
Knowledge Management Best Practices within Service Management
Knowledge Management Best Practices within Service ManagementKnowledge Management Best Practices within Service Management
Knowledge Management Best Practices within Service Management
 
Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace data
 
10 Differences Between eDiscovery & Information Governance
10 Differences Between eDiscovery & Information Governance10 Differences Between eDiscovery & Information Governance
10 Differences Between eDiscovery & Information Governance
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 

More from Grant Ingersoll

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Grant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
 

More from Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Large Scale Search, Discovery and Analytics in Action

  • 1. Large Scale Search, Discovery and Analysis in Action Grant Ingersoll Chief Scientist September 18, 2012 Confidential © Copyright 2012
  • 2. Search is Dead, Long Live Search • Good keyword search is a commodity and easy to get up and running Documents • The Bar is Raised • Relevance is (always will be?) hard Content User • Holistic view of the data Relationships Interaction AND the users is critical • Search, Discovery and Analytics are the key to unlocking this view of users Access and data Confidential and Proprietary © 2012 LucidWorks
  • 3. Topics • Background and needs • Architecture • Road Ahead • SDA In Action • Components • Challenges and Lessons Learned • Wrap Up Confidential and Proprietary © 2012 LucidWorks
  • 5. Sample Use Cases • Claims processing and analysis, including fraud analysis • Large scale content acquisition and access for: • Defense, intelligence and pharmaceutical applications • Views of data surrounding natural disasters and other tragedies for research, archiving and therapeutic purposes • Analysis of Website and social media interactions • Access and processing of genetic information for improved medical treatments • Log processing and fraud detection in telecommunications Confidential and Proprietary 5 © 2012 LucidWorks
  • 6. In Focus: Personalized Medicine Alignment and other Genetic analysis Variations Patient DNA Standard Therapies Alternative Therapies Search and Faceting Confidential and Proprietary 6 © 2012 LucidWorks
  • 7. In Focus: Log Processing in Telecommunications • Each year, large sums of money are lost due to fraudulent calls and poor service • Logs are usually semi-structured and contain vital information about errors and fraud • Deeper batch analytics can provide insight into patterns across vast amounts of data • Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities Confidential and Proprietary 7 © 2012 LucidWorks
  • 13. Computation and Storage LucidWorks Hadoop HBase Search/Solr • Document Index • Stores Logs, • Metric Storage Raw files, • Faceting intermediate • User files, etc. Histories/Profile • WebHDFS • SolrCloud makes sharding • Document easy • Small files are Storage an unnatural act Challenges • Who is the authoritative store? • Real time vs. Batch • Where should analysis be done? Confidential and Proprietary © 2012 LucidWorks
  • 14. Search In Practice • Three primary concerns • Performance/Scaling • Relevance • Operations: monitoring, failover, etc. • Business typically cares more about relevance • Devs care more about performance at first… Confidential and Proprietary © 2012 LucidWorks
  • 15. Search: Relevance • Always Be Testing • Experiment management is critical • Top X + sampling • Click Logs • Track Everything! • Queries • Clicks • Displayed Documents • Mouse/Scroll tracking? • Phrases are your friends Confidential and Proprietary © 2012 LucidWorks
  • 16. Discovery Components Serendipity Organization Data Quality • Trends • Importance • Document factor • Topics • Clustering Distributions • Recommendations • Classification • Length • Related Items • Named Entities • Boosts • More Like This • Time Factors • Duplicates • Did you mean? • Faceting • Stat. Interesting Phrases Challenges • Many of these are intense calculations or iterative • Many are subjective and require a lot of experimentation Confidential and Proprietary © 2012 LucidWorks
  • 17. Discovery with Mahout • Mahout’s 3 “C”s provide tools for helping across many aspects of discovery • Collaborative Filtering • Classification • Clustering • Also: • Collocations (Statistically Interesting Phrases) • Singular Value Decomposition (SVD) • Others • Challenges: • High cost to iterative machine learning algorithms • Mahout is very command line oriented • Some areas less mature Confidential and Proprietary © 2012 LucidWorks
  • 18. Aside: Experiment Management • Plan for running experiments from the beginning across Search and Discovery components • Your engine should help! • Types of Experiments to consider • Indexing/Analysis • Query parsing • Scoring formulas • Machine Learning Models • Recommendations, many more • Make it easy to do A/B testing across all experiments and compare and contrast the results Confidential and Proprietary © 2012 LucidWorks
  • 19. Analytics in Practice • Many of the components discussed provide analytical features • Leverage existing tools: R, etc. • Simple Counts: • Facets • Term and Document frequencies • Clicks • Search and Discovery example metrics • Relevance measures like Mean Reciprocal Rank • Histograms/Drilldowns around Number of Results • Log and navigation analysis • Data cleanliness analysis is helpful for finding potential issues in content Confidential and Proprietary © 2012 LucidWorks
  • 20. Wrap • Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users • LucidWorks has combined many of these things into LucidWorks Big Data • http://www.lucidworks.com/products/lucidworks-big-data • Design for the big picture when building search-based applications Confidential and Proprietary © 2012 LucidWorks
  • 21. Discussion and Resources • Questions? • http://www.lucidworks.com • grant@lucidworks.com • @gsingers Confidential and Proprietary 21 © 2012 LucidWorks

Editor's Notes

  1. The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
  2. How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
  3. Make into images?
  4. All about ad hoc and bulk storage and computationAll about the analytics that drive your computationGlue to make it all work together – data where it needs to be when it needs to be thereAll are examples of ways to do this. There are actually a fair number of viable alternatives for all of these pieces, all in open sourceI tend to stick to Apache and “commercial” friendly licenses, where possible
  5. Analytics:Discovery:– Recommendations, trends, related searches
  6. Authoritative store: managing across, consistency, etc.Analysis should be done where it most makes sense given the location of the data and the type of analysis being doneHadoop and HBase stuff are all pretty straightforward
  7. Log and navigation: clicks, search trails, etc.Data cleanliness: Never viewed docs that are related to other documents
  8. Big Picture: too often devs are stuck in the weeds