SlideShare a Scribd company logo
1 of 28
Infinit.e: An Open Analytic Platform
  Driven by MongoDB & Hadoop
Agenda
• Who we are
• What Infinit.e is
• Architecture
  –   Use of Open Source
  –   Elasticsearch / MongoDB / Hadoop combo
  –   Focus on MongoDB
  –   Focus on Hadoop
• Demo
• Questions
Who we are
IKANOW (ikanow.com)
• Our vision is to enable agile intelligence
  through open analytics
• Our engineering vision is to use the best OSS
  technologies to build a document analysis
  platform that will enable this and then Open
  Source it back to the community
  – https://github.com/IKANOW/Infinit.e
  – http://bit.ly/ikanow-oss
What Infinit.e is
Infinit.e is a scalable framework for:
• Collecting,
• Storing,
• Enriching,
• Retrieving,
• Analyzing, and
• Visualizing
Unstructured documents and structured
records
What Infinit.e is - Overview
What Infinit.e is - Documents

20% Structured
     • Log files
     • Databases
     • Apps        •   Documents
                   •   Presentations
                   •   Spreadsheets
                   •   Meeting notes
                   •   Email
                   •   IM chats
                   •   Reports
                   •   Social

         80% Unstructured              •   Entities
                                       •   Events
                                       •   Facts
                                       •   Sentiment
Unstructured and Structured Data       •   Geospatial
                                       •   Temporal
                                       •   Themes
What Infinit.e is - Documents
        Duke and Progress announced merger plans in
        January 2012
            Bernanke, 57 said in his testimony price increases
            “have begun to moderate” after a jump in oil costs   Who
                                                                 people, organizations,
            earlier this year
                                                                 facilities, company

                                                                 What
               Tablet ownership levels hit 18% in China, the     events, summaries,
                UK and US versus 3% in November 2010             facts, themes

                                                                 When
       <Incident>                                                past, present, future
         <uid>20101043423</uid>                                  dates
         <subject>1 person killed in armed attack by
        suspected Boko Haram in
                                                                 Where
        Maiduguri, Borno, Nigeria</subject>                      city, state, country,
         <multipleDays>No</multipleDays>                         coordinate
         <eventDate>06/04/2011</eventDate>
       </Incident>
What Infinit.e is - Framework
What Infinit.e is - Visualization
Architecture
Use of Open Source
Architecture
Use of Open Source
Architecture
The 3 Key Elements
Architecture
Focus on MongoDB

3 key areas of benefit:
• Development
• Integration
• Deployment
MongoDB Development
Document analysis – lots of complex generic logic written in
Java
    The “records” are all complex objects
         BSON/JSON is a perfect representation
     Usually code maintainability is most important
          BSON → “Plain Old Java Object”
              (we use GSON, probably JACKSON is better;
               though GSON extensions for MongoDB types
               like dates and ObjectIds worked nicely)
      Sometimes performance is most important
           Option to stay in BSON
MongoDB – Dev Examples
   Converting to “POJO”
DocumentPojo docIn = new DocumentPojo();
docIn.setId(new ObjectId(idStr));
DocumentPojo docOut = DocumentPojo.fromDb(
    DbManager.getDocument().getMetadata().findOne(docIn.toDb()));

   Hybrid
BasicDBObject query = new BasicDBObject(DocumentPojo.communityId_,
    new BasicDBObject(MongoDbManager.in_, communityIdList));
// (then as above)

   Working in BSON only
BasicDBList l = (BasicDBList)(f.get(DocumentPojo.entities_));
for(Iterator<?> e0 = l.iterator(); e0.hasNext();){
    BasicDBObject e = (BasicDBObject)e0.next();
MongoDB
Changing Data Model
Standard requirement, particularly for an evolving project based on
whatever functionality can be derived from the latest technologies…
•  Example
             We have sentiment as a property of entity
              (person/place/organization)
             association links 2 entity objects via a verb
             New capability: NLP engine can now provide directed sentiment
              from one entity to another!
• Often requires no extra dev effort at all...
             Adding fields, eg just add sentiment to association above
•  Otherwise, built in JSON format makes data model migrations easy
             Have performed 2 major data model changes in 18 months, both
              via simple map/reduce scripts, with backwards compatibility
MongoDB Integration
Infinit.e is based on NoSQL and web 2.0 technologies
     ElasticSearch – JSON engine
      Javascript/Actionscript – JSON a key component
       NLP SaaS engines – JSON-based
A key component of the custom ingest/enrichment is the
ability to tag arbitrary source-specific metadata onto
documents
        Allows custom search / analytics / visualization
         “Best of both worlds” in conjunction with generic data
           model
          Schema-less storage is essential
MongoDB Deployment
Need to scale in many directions:
     Writes due to new documents
     Reads for queries
     The ability to scale execution of domain
      specific logic
          On ingest
          Batch analytics
Infinit.e is designed to use platforms like
EC2 to scale
MongoDB Deployment
MongoDB Deployment
MongoDB scalability
   Works!
           Scales to arbitrary sizes in both read/write dimensions
      Sophisticated sharding keys provide powerful/flexible
       balancing
      Downsides:
           Building an initial cluster is quite complex
           Managing cluster changes is quite fiddly
      For Infinit.e we used CloudFormation templates and
       (RPM-based) install scripts to manage the cluster
           Works OK, a graphical tool and some more robustness would be
            nice
               (on our roadmap, but not very close!)
MongoDB Deployment
MongoDB/EC2 integration
   m1.xlarge works best for our needs (m1.large is fine for
    ~0.5M docs)
          4 cores, 15GB
          4 500GB ephemeral disks that we RAID-0 together
              (without that performance dropped off a cliff at >1M docs)
Architecture
Focus on Hadoop
Why Hadoop?
   Queries/aggregation/visualization is an excellent first step
    for document analysis, and is often all that's required
           More complex analytics requires
                 Access to all of the data, not pre-aggregated or selected
                 A high level programming language, mature libraries etc
      Hadoop is becoming the de-facto standard for data
       analytics
           Open Source, very customizable
           Proven scalability
           Java libraries
           Mahout project (machine learning libraries for Hadoop)
           Amazon elastic cloud
Architecture
MongoDB / Hadoop
Infinit.e Demonstration
Infinit.e Demonstration
Infinit.e Demonstration
Infinit.e Demonstration
Thank You!!!


              Alex Piggott
    Director of Product Engineering
        apiggott@ikanow.com

More Related Content

Similar to MongoDC - Ikanow April 2012 Meetup

Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDBEugene Park
 
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...MongoDB
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012MongoDB
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studySharjeel Imtiaz
 
MongoDB meetup at Hike
MongoDB meetup at HikeMongoDB meetup at Hike
MongoDB meetup at HikeBharvi Dixit
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectDATAVERSITY
 
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...Nuxeo
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentationHyphen Call
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsikanow
 
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDBPartner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDBMongoDB
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdfRAHULRAHU8
 
Introducing MongoDB into your Organization
Introducing MongoDB into your OrganizationIntroducing MongoDB into your Organization
Introducing MongoDB into your OrganizationMongoDB
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLMayur Patil
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
From Traditional ECM to Content Services: Modernizing Content Management with...
From Traditional ECM to Content Services: Modernizing Content Management with...From Traditional ECM to Content Services: Modernizing Content Management with...
From Traditional ECM to Content Services: Modernizing Content Management with...Joel Oleson
 

Similar to MongoDC - Ikanow April 2012 Meetup (20)

Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDB
 
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 
MongoDB meetup at Hike
MongoDB meetup at HikeMongoDB meetup at Hike
MongoDB meetup at Hike
 
How to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot ProjectHow to Get Started with Your MongoDB Pilot Project
How to Get Started with Your MongoDB Pilot Project
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
Nuxeo World Session: Gagnavarslan and Nuxeo - Building software services on N...
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problems
 
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDBPartner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
Partner Webinar: Deliver Big Data Apps Faster With Informatica & MongoDB
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
 
Introducing MongoDB into your Organization
Introducing MongoDB into your OrganizationIntroducing MongoDB into your Organization
Introducing MongoDB into your Organization
 
Introduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQLIntroduction to MongoDB Basics from SQL to NoSQL
Introduction to MongoDB Basics from SQL to NoSQL
 
Big Data
Big DataBig Data
Big Data
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
From Traditional ECM to Content Services: Modernizing Content Management with...
From Traditional ECM to Content Services: Modernizing Content Management with...From Traditional ECM to Content Services: Modernizing Content Management with...
From Traditional ECM to Content Services: Modernizing Content Management with...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

More from ikanow

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Dataikanow
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014ikanow
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS ikanow
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Open Analytics DC June 2012 Presentation
Open Analytics DC June 2012 PresentationOpen Analytics DC June 2012 Presentation
Open Analytics DC June 2012 Presentationikanow
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetupikanow
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?ikanow
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analyticsikanow
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Dataikanow
 

More from ikanow (9)

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Open Analytics DC June 2012 Presentation
Open Analytics DC June 2012 PresentationOpen Analytics DC June 2012 Presentation
Open Analytics DC June 2012 Presentation
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

MongoDC - Ikanow April 2012 Meetup

  • 1. Infinit.e: An Open Analytic Platform Driven by MongoDB & Hadoop
  • 2. Agenda • Who we are • What Infinit.e is • Architecture – Use of Open Source – Elasticsearch / MongoDB / Hadoop combo – Focus on MongoDB – Focus on Hadoop • Demo • Questions
  • 3. Who we are IKANOW (ikanow.com) • Our vision is to enable agile intelligence through open analytics • Our engineering vision is to use the best OSS technologies to build a document analysis platform that will enable this and then Open Source it back to the community – https://github.com/IKANOW/Infinit.e – http://bit.ly/ikanow-oss
  • 4. What Infinit.e is Infinit.e is a scalable framework for: • Collecting, • Storing, • Enriching, • Retrieving, • Analyzing, and • Visualizing Unstructured documents and structured records
  • 5. What Infinit.e is - Overview
  • 6. What Infinit.e is - Documents 20% Structured • Log files • Databases • Apps • Documents • Presentations • Spreadsheets • Meeting notes • Email • IM chats • Reports • Social 80% Unstructured • Entities • Events • Facts • Sentiment Unstructured and Structured Data • Geospatial • Temporal • Themes
  • 7. What Infinit.e is - Documents Duke and Progress announced merger plans in January 2012 Bernanke, 57 said in his testimony price increases “have begun to moderate” after a jump in oil costs Who people, organizations, earlier this year facilities, company What Tablet ownership levels hit 18% in China, the events, summaries, UK and US versus 3% in November 2010 facts, themes When <Incident> past, present, future <uid>20101043423</uid> dates <subject>1 person killed in armed attack by suspected Boko Haram in Where Maiduguri, Borno, Nigeria</subject> city, state, country, <multipleDays>No</multipleDays> coordinate <eventDate>06/04/2011</eventDate> </Incident>
  • 8. What Infinit.e is - Framework
  • 9. What Infinit.e is - Visualization
  • 13. Architecture Focus on MongoDB 3 key areas of benefit: • Development • Integration • Deployment
  • 14. MongoDB Development Document analysis – lots of complex generic logic written in Java  The “records” are all complex objects  BSON/JSON is a perfect representation  Usually code maintainability is most important  BSON → “Plain Old Java Object”  (we use GSON, probably JACKSON is better; though GSON extensions for MongoDB types like dates and ObjectIds worked nicely)  Sometimes performance is most important  Option to stay in BSON
  • 15. MongoDB – Dev Examples  Converting to “POJO” DocumentPojo docIn = new DocumentPojo(); docIn.setId(new ObjectId(idStr)); DocumentPojo docOut = DocumentPojo.fromDb( DbManager.getDocument().getMetadata().findOne(docIn.toDb()));  Hybrid BasicDBObject query = new BasicDBObject(DocumentPojo.communityId_, new BasicDBObject(MongoDbManager.in_, communityIdList)); // (then as above)  Working in BSON only BasicDBList l = (BasicDBList)(f.get(DocumentPojo.entities_)); for(Iterator<?> e0 = l.iterator(); e0.hasNext();){ BasicDBObject e = (BasicDBObject)e0.next();
  • 16. MongoDB Changing Data Model Standard requirement, particularly for an evolving project based on whatever functionality can be derived from the latest technologies… • Example  We have sentiment as a property of entity (person/place/organization)  association links 2 entity objects via a verb  New capability: NLP engine can now provide directed sentiment from one entity to another! • Often requires no extra dev effort at all...  Adding fields, eg just add sentiment to association above • Otherwise, built in JSON format makes data model migrations easy  Have performed 2 major data model changes in 18 months, both via simple map/reduce scripts, with backwards compatibility
  • 17. MongoDB Integration Infinit.e is based on NoSQL and web 2.0 technologies  ElasticSearch – JSON engine  Javascript/Actionscript – JSON a key component  NLP SaaS engines – JSON-based A key component of the custom ingest/enrichment is the ability to tag arbitrary source-specific metadata onto documents  Allows custom search / analytics / visualization  “Best of both worlds” in conjunction with generic data model  Schema-less storage is essential
  • 18. MongoDB Deployment Need to scale in many directions:  Writes due to new documents  Reads for queries  The ability to scale execution of domain specific logic  On ingest  Batch analytics Infinit.e is designed to use platforms like EC2 to scale
  • 20. MongoDB Deployment MongoDB scalability  Works!  Scales to arbitrary sizes in both read/write dimensions  Sophisticated sharding keys provide powerful/flexible balancing  Downsides:  Building an initial cluster is quite complex  Managing cluster changes is quite fiddly  For Infinit.e we used CloudFormation templates and (RPM-based) install scripts to manage the cluster  Works OK, a graphical tool and some more robustness would be nice  (on our roadmap, but not very close!)
  • 21. MongoDB Deployment MongoDB/EC2 integration  m1.xlarge works best for our needs (m1.large is fine for ~0.5M docs)  4 cores, 15GB  4 500GB ephemeral disks that we RAID-0 together  (without that performance dropped off a cliff at >1M docs)
  • 22. Architecture Focus on Hadoop Why Hadoop?  Queries/aggregation/visualization is an excellent first step for document analysis, and is often all that's required  More complex analytics requires  Access to all of the data, not pre-aggregated or selected  A high level programming language, mature libraries etc  Hadoop is becoming the de-facto standard for data analytics  Open Source, very customizable  Proven scalability  Java libraries  Mahout project (machine learning libraries for Hadoop)  Amazon elastic cloud
  • 28. Thank You!!! Alex Piggott Director of Product Engineering apiggott@ikanow.com