SlideShare a Scribd company logo
1 of 12
Leveraging Solr and Mahout for Next
                                Gen Data Access and Insight

                                Grant Ingersoll
                                Chief Scientist




Confidential © Copyright 2012
Search is Dead, Long Live Search

• Modern Data Challenges are multi-structured

• Search is a system building block                      Content

    - Text is only a part of the story


• If the algorithms fit,
                                           Content
             use them!                   Relationships
                                                                   Users




• Embrace fuzziness!
                                                         Access

• Scoring features are everywhere

Confidential and Proprietary
© 2012 LucidWorks
Topics

    • Intros

    • Search (R)Evolution

    • Apache Solr
    • Apache Mahout

    • Search and Machine Learning

    • Scaling


    Confidential and Proprietary
3   © 2012 LucidWorks
Grant’s Background

• Co-founder:
    - LucidWorks – Chief Scientist
    - Apache Mahout
• Long time Lucene/Solr committer
• Author: Taming Text
    - www.manning.com/ingersoll
• Background in IR and NLP
    - Built CLIR, QA and a variety of other search-based apps




Confidential and Proprietary
© 2012 LucidWorks
Search (R)evolution

• Search use leads to search abuse
    - Denormalization frees your mind
    - Scoring is just a sparse matrix multiply

• Lucene/Solr evolution
    -   Non-free text usages abound
    -   Many DB-like features
    -   NoSQL before NoSQL was cool
    -   Flexible indexing
    -   Finite State Transducers FTW!

• Scale

• “This ain’t your father’s relevance anymore”

Confidential and Proprietary
© 2012 LucidWorks
Apache Solr?

• “Solr is an open source enterprise search server based
  on the Lucene Java search library, with XML/HTTP and
  JSON APIs, hit highlighting, faceted
  search, caching, replication, a web administration
  interface and many more features. It runs in a Java
  servlet container such as Tomcat. “
    - http://lucene.apache.org/solr


• Did I mention free?




Confidential and Proprietary
© 2012 LucidWorks
Apache Mahout

• Goal: create library of scalable machine learning
  algorithms

• Mahout’s 3 “C”s provide tools for helping across many
  aspects of discovery
    - Collaborative Filtering
    - Classification
    - Clustering
• Also:
    - Collocations (Statistically Interesting Phrases)
    - SVD
    - Java math, primitives libraries and more

Confidential and Proprietary
© 2012 LucidWorks
Search + Machine Learning

• Search-driven applications present multiple
  opportunities for leveraging machine learning
    - Clustering – Enhance Discovery, outlier detection
    - Classification – Queries, Documents, Users
    - Content Recommendation – Collab. Filtering and
      personalization
    - NLP – phrases, named entities, co-reference, much more


• Many of these can also power faceted navigation

• Aside: Search can also often be used effectively to
  implement many machine learning algorithms

Confidential and Proprietary
© 2012 LucidWorks
How and When
                                                    Access APIs
                                                                    •View into
                                Search View             Analytic     numeric/hist     Personalization &
                                                                     oric data
                 1                                      Services                      Machine Learning
                      2                                                                   Services
              Shards       3                 N
                                                                                             •Classification
                                                                                             •Recommendation

                                                                         •Documents      Classification
                  Discovery &                            Document
                                                           Store         •Users             Models
                  Enrichment                                             •Logs
                     Clustering, classific                                               In memory
                     ation, NLP, topic                                                   Replicated
                     identification, searc                                               Multi-tenant
                     h log analysis, user
                     behavior
                                                 Content Acquisition
                                                    ETL, batch or near
                                                    real-time



                   Data
         • LucidWorks Search
           connectors
         • Push


Confidential and Proprietary
© 2012 LucidWorks
Scaling

• Search
    - Solr Cloud = Large scale, distributed search and faceting
          » http://wiki.apache.org/solr/SolrCloud


• Machine Learning
    - Mahout is built on Hadoop for most things
    - SGD is sequential and really fast


• Sometimes all you can do is make an educated guess
    - Storm, Kafka, etc. can help by allowing you to make estimates in
      near real time



Confidential and Proprietary
© 2012 LucidWorks
Wrap

• Search, Discovery and Analytics, when combined into
  a single, coherent system provides powerful insight into
  both your content and your users

• LucidWorks has combined many of these things into
  LucidWorks Big Data
    - http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based
  applications



Confidential and Proprietary
© 2012 LucidWorks
Resources

• LucidWorks
    - http://www.lucidworks.com
    - http://www.lucidworks.com/products/lucidworks-big-data
    - @LucidImagineer

• Me
    - grant@lucidworks.com
    - @gsingers


• Taming Text
    - http://www.manning.com/ingersoll
    - http://www.tamingtext.com
    - @tamingtext

Confidential and Proprietary
© 2012 LucidWorks

More Related Content

More from Caserta

Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 

More from Caserta (20)

Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Big Data Warehousing: Leveraging Solr & Mahout

  • 1. Leveraging Solr and Mahout for Next Gen Data Access and Insight Grant Ingersoll Chief Scientist Confidential © Copyright 2012
  • 2. Search is Dead, Long Live Search • Modern Data Challenges are multi-structured • Search is a system building block Content - Text is only a part of the story • If the algorithms fit, Content use them! Relationships Users • Embrace fuzziness! Access • Scoring features are everywhere Confidential and Proprietary © 2012 LucidWorks
  • 3. Topics • Intros • Search (R)Evolution • Apache Solr • Apache Mahout • Search and Machine Learning • Scaling Confidential and Proprietary 3 © 2012 LucidWorks
  • 4. Grant’s Background • Co-founder: - LucidWorks – Chief Scientist - Apache Mahout • Long time Lucene/Solr committer • Author: Taming Text - www.manning.com/ingersoll • Background in IR and NLP - Built CLIR, QA and a variety of other search-based apps Confidential and Proprietary © 2012 LucidWorks
  • 5. Search (R)evolution • Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply • Lucene/Solr evolution - Non-free text usages abound - Many DB-like features - NoSQL before NoSQL was cool - Flexible indexing - Finite State Transducers FTW! • Scale • “This ain’t your father’s relevance anymore” Confidential and Proprietary © 2012 LucidWorks
  • 6. Apache Solr? • “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat. “ - http://lucene.apache.org/solr • Did I mention free? Confidential and Proprietary © 2012 LucidWorks
  • 7. Apache Mahout • Goal: create library of scalable machine learning algorithms • Mahout’s 3 “C”s provide tools for helping across many aspects of discovery - Collaborative Filtering - Classification - Clustering • Also: - Collocations (Statistically Interesting Phrases) - SVD - Java math, primitives libraries and more Confidential and Proprietary © 2012 LucidWorks
  • 8. Search + Machine Learning • Search-driven applications present multiple opportunities for leveraging machine learning - Clustering – Enhance Discovery, outlier detection - Classification – Queries, Documents, Users - Content Recommendation – Collab. Filtering and personalization - NLP – phrases, named entities, co-reference, much more • Many of these can also power faceted navigation • Aside: Search can also often be used effectively to implement many machine learning algorithms Confidential and Proprietary © 2012 LucidWorks
  • 9. How and When Access APIs •View into Search View Analytic numeric/hist Personalization & oric data 1 Services Machine Learning 2 Services Shards 3 N •Classification •Recommendation •Documents Classification Discovery & Document Store •Users Models Enrichment •Logs Clustering, classific In memory ation, NLP, topic Replicated identification, searc Multi-tenant h log analysis, user behavior Content Acquisition ETL, batch or near real-time Data • LucidWorks Search connectors • Push Confidential and Proprietary © 2012 LucidWorks
  • 10. Scaling • Search - Solr Cloud = Large scale, distributed search and faceting » http://wiki.apache.org/solr/SolrCloud • Machine Learning - Mahout is built on Hadoop for most things - SGD is sequential and really fast • Sometimes all you can do is make an educated guess - Storm, Kafka, etc. can help by allowing you to make estimates in near real time Confidential and Proprietary © 2012 LucidWorks
  • 11. Wrap • Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users • LucidWorks has combined many of these things into LucidWorks Big Data - http://www.lucidworks.com/products/lucidworks-big-data • Design for the big picture when building search-based applications Confidential and Proprietary © 2012 LucidWorks
  • 12. Resources • LucidWorks - http://www.lucidworks.com - http://www.lucidworks.com/products/lucidworks-big-data - @LucidImagineer • Me - grant@lucidworks.com - @gsingers • Taming Text - http://www.manning.com/ingersoll - http://www.tamingtext.com - @tamingtext Confidential and Proprietary © 2012 LucidWorks

Editor's Notes

  1. This is a money slide where people should say “Wow man”. They shouldn’t understand the implications of this, but they should be very, very aware that something big just slide into the room.Tech Building Block: Not just textNot just users + queriesEmbrace Fuzziness: Esp. in Big Data, it is the only way you are going to survive.TED: I think that this should make the case for advanced that is still search at its heart. The idea that search can be radically changed should be on the next slide.
  2. Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much moreSearch has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  3. Big Picture: too often devs are stuck in the weeds