SlideShare a Scribd company logo
1 of 37
The Open Source document analysis platform



  Or, how IKANOW uses
to help organizations solve really big problems
Agenda
• What is Document Analysis?
• The Infinit.e Solution
  – Infinit.e’s Architecture
  – Why and How we use MongoDB
• Analyzing #MongoDC
• Questions
This is what Big Data Looks Like




                                          Shamelessly stolen from:
   http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
What is Document Analysis?
 "Document Analysis refers to
 computer-assisted analysis of large numbers
 of documents in order to answer questions
 about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• Common document source formats:
RSS                JSON            XML

HTML               PDF             TXT

RTF                Word            PPT

Multimedia Files   RDBMS Records   ETC.
Document Analysis
• The goal is to:
  – Extract Entities (people, places, things)
  – Create Associations between entities (in the
    form of noun-verb-noun), e.g.:
     •   John Doe lives in Washington, D.C
     •   John Doe is married to Jane Doe
     •   John Doe is a Virgo
     •   John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When and
  Where into a unified data structure that
  supports data analytics and visualization.
Who                                When
people, organizations,             past, present, future
facilities, company                dates

What                               Where
events, summaries,                 city, state, country,
facts, themes                      coordinate
The Infinit.e Solution
• Infinit.e is an Open Source
  document discovery and
  analysis platform that has
  these very cool Open Source
  tools lurking under the hood.


      github.com/ikanow/Infinit.e
The Infinit.e Solution

      Infinit.e is a
        scalable
    framework for                                           Visualizing
                                                Analyzing
                                   Retrieving
                       Enriching
             Storing
Collecting

                                        Structured and
                                   Unstructured Documents
IkanMeow
Document Collection
• Infinit.e harvests documents from:

  – URLs

  – File Shares

  – Databases
Sample RSS Document
<rss version="2.0">
<channel>
…
<item>
    <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title>
    <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in-
    egypt-tunisia-report-by-egyptlastminute-com-13613.html</link>
    <description>Report by egyptlastminute.com CAIRO: On Monday, the             countries of the
     Mediterranean opened a conference seeking to enhance the             future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia the most …
    </description>
    <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>
    <dc:creator>unknown</dc:creator>
    <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>
…
</channel>
</rss>
Full Text Source
Source Ingestion Data Flow
Document DBs and Collections
Document Metadata
• doc_metadata.metadata
{
    "_id" : ObjectId("4f93638e0cf212156d0559d2"),
    "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",
    "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-
    in-egypt-tunisia-report-by-egyptlastminute-com-13613.html"
    "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the
    Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia; the most ...",
    "created" : ISODate("2012-04-22T01:49:02Z"),
    “metadata” : {…},
    "associations" : […],
    "entities" : […],
    ...
}
Harvested Document Metadata
• doc_metadata.metadata.metadata
"metadata" : {
     "location" : [
            {                                                          Note: It is okay to laugh at this
                   "region" : "South Asia",
                   "citystateprovince" : {
                          "stateprovince" : "Rolpa”, "city" : "Newang"
                   },
                   "country" : "Nepal"
            }
     ],
     "icn" : [ "200573487" ],
     "incidentdate" : [ "07/25/2005" ],
     "organization" : [
            "Communist Party of Nepal (Maoist)/United People's Front”
     ],
     ...
},
Document Enrichment
• Infinit.e supports the extraction of entities
  and creation of associations using a
  combination of built in enrichment libraries
  and 3rd party NLP APIs including:
Harvested Entities
• feature.entity
{
    "_id" : ObjectId("4f9189d48baf188282a1c9ef"),
    "alias" : [
           "Zine el Abidine Ben Ali",
           "Zine El Abidine Ben Ali",
           "Zine el Abidine ben Ali"
    ],
    "batch_resync" : true,
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(143),
    "db_sync_time" : "1338751174988",
    "dimension" : "Who",
    "disambiguated_name" : "Zine El Abidine Ben Ali",
    "doccount" : 152,
    "index" : "zine el abidine ben ali/person",
    "totalfreq" : 353,
    "type" : "Person"
}
Harvested Entities
Harvested Associations
• feature.association
{
    "_id" : ObjectId("4f9189d48baf188282a1ca24"),
    "assoc_type" : "Fact",
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(70),
    "db_sync_time" : "1338491609281",
    "doccount" : NumberLong(73),
    "entity1" : [
           "zine el abidine ben ali",
           "zine el abidine ben ali/person"
    ],
    "entity1_index" : "zine el abidine ben ali/person",
    "entity2" : ["president”,"president/position”],
    "entity2_index" : "president/position",
    "index" : "5e3fff27ddb78d6873ccfc77cf05c52f",
    "verb" : ["career”,"current”,"past”],
    "verb_category" : "career"
}
Harvested Associations
Geolocation of Entities/Events
• feature.geo
{
    "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),
    "search_field" : "cairo",
    "country" : "Egypt",
    "country_code" : "EG",
    "city" : "cairo",
    "region" : "Al Qahirah",
    "region_code" : "EG11",
    "population" : 7734602,
    "latitude" : "30.05",
    "longitude" : "31.25",
    "geoindex" : {
           "lat" : 30.05,
           "lon" : 31.25                            Note: MongoDB 2d Index
    }
}
Geolocation of Entities/Events
Who, What, Where and When
Why MongoDB? – Reason #1
Document-Oriented Storage
• MongoDB’s document-oriented storage
  (i.e. schema-less) is perfectly suited to the
  data design requirements of a system that
  needs to ingest a wide variety of
  structured and unstructured document
  formats and normalize them into one
  unified, semi-structured format
Why MongoDB? – Reason #2
JSON
• The standard language of open document
  analysis
  – JSON is a common interchange format supported
    by tools like elasticsearch and SaaS NLP engines
  – BSON (Binary JSON) is MongoDB’s native data
    format
  – Infinit.e ingests and exports JSON
    natively via the REST based API
    Note: Infinit.e uses Google’s GSON JAVA library to convert
    JSON to POJOs and back




                                               This is the JSON logo
Why MongoDB? – Reason #3
MongoDB Is Web Scale*




  *Shards are the secret ingredients in the web scale sauce. They just work.
Why MongoDB? – Reason #3
Scalability
• Seriously, MongoDB Scales
  – Harvesting and enriching documents requires
    a lot of disk space
  – MongoDB scales to arbitrary sizes in both
    read/write dimensions
  – Sophisticated sharding keys provide
    powerful/flexible balancing
   BUT building an initial cluster can be complex
    and managing cluster changes is “fiddly”
Why MongoDB? – Reason #4
Integration with Apache Hadoop
•   Hadoop is rapidly becoming the de-facto standard for
    data analytics
     – Open Source, very customizable
     – Proven scalability
     – Java libraries
•   The MongoDB Hadoop Adaptor allows Hadoop to read
    from and write to MongoDB instead of HDFS

                  +                 =
Tweeting about MongoDC
• Source:
  http://search.twitter.com/search.rss?q=mongodc
   – Who’s Tweeting?
   – What are they Tweeting?
   – What does basic document analysis of these
     Tweets tell us?
Who’s Tweeting about MongoDC?
How are Tweeter’s Connected?
What are they Tweeting About?
Sentiment?
Twitter has its Limits…
Thank You!

             Craig Vitter



         www.ikanow.com
        cvitter@ikanow.com

More Related Content

Similar to How IKANOW uses MongoDB to help organizations solve really big problems

What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBMongoDB
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarSpazioDati
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikMarko Grobelnik
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointJoanne Klein
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseRobert Lujo
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBMongoDB
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pkuguest8ed46d
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pkuwiser pku
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsJohn Breslin
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integrationkeepingfoundthingsfound
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries mdabrowski
 

Similar to How IKANOW uses MongoDB to help organizations solve really big problems (20)

What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDB
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"
 
Digital Content Management
Digital Content ManagementDigital Content Management
Digital Content Management
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social Semantics
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
LOD2 Webinar: SIREn
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 

More from ikanow

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Dataikanow
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014ikanow
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysisikanow
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computingikanow
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS ikanow
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetupikanow
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?ikanow
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analyticsikanow
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Dataikanow
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysisikanow
 

More from ikanow (11)

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysis
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

How IKANOW uses MongoDB to help organizations solve really big problems

  • 1. The Open Source document analysis platform Or, how IKANOW uses to help organizations solve really big problems
  • 2. Agenda • What is Document Analysis? • The Infinit.e Solution – Infinit.e’s Architecture – Why and How we use MongoDB • Analyzing #MongoDC • Questions
  • 3. This is what Big Data Looks Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  • 4. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  • 5. Document Analysis • Common document source formats: RSS JSON XML HTML PDF TXT RTF Word PPT Multimedia Files RDBMS Records ETC.
  • 6. Document Analysis • The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011 • And…
  • 7. Document Analysis • Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization. Who When people, organizations, past, present, future facilities, company dates What Where events, summaries, city, state, country, facts, themes coordinate
  • 8. The Infinit.e Solution • Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Source tools lurking under the hood. github.com/ikanow/Infinit.e
  • 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching Storing Collecting Structured and Unstructured Documents
  • 11. Document Collection • Infinit.e harvests documents from: – URLs – File Shares – Databases
  • 12. Sample RSS Document <rss version="2.0"> <channel> … <item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date> </item> … </channel> </rss>
  • 15. Document DBs and Collections
  • 16. Document Metadata • doc_metadata.metadata { "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ... }
  • 17. Harvested Document Metadata • doc_metadata.metadata.metadata "metadata" : { "location" : [ { Note: It is okay to laugh at this "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United People's Front” ], ... },
  • 18. Document Enrichment • Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  • 19. Harvested Entities • feature.entity { "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person" }
  • 21. Harvested Associations • feature.association { "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career" }
  • 23. Geolocation of Entities/Events • feature.geo { "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 Note: MongoDB 2d Index } }
  • 25. Who, What, Where and When
  • 26. Why MongoDB? – Reason #1 Document-Oriented Storage • MongoDB’s document-oriented storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format
  • 27. Why MongoDB? – Reason #2 JSON • The standard language of open document analysis – JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines – BSON (Binary JSON) is MongoDB’s native data format – Infinit.e ingests and exports JSON natively via the REST based API Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back This is the JSON logo
  • 28. Why MongoDB? – Reason #3 MongoDB Is Web Scale* *Shards are the secret ingredients in the web scale sauce. They just work.
  • 29. Why MongoDB? – Reason #3 Scalability • Seriously, MongoDB Scales – Harvesting and enriching documents requires a lot of disk space – MongoDB scales to arbitrary sizes in both read/write dimensions – Sophisticated sharding keys provide powerful/flexible balancing  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”
  • 30. Why MongoDB? – Reason #4 Integration with Apache Hadoop • Hadoop is rapidly becoming the de-facto standard for data analytics – Open Source, very customizable – Proven scalability – Java libraries • The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS + =
  • 31. Tweeting about MongoDC • Source: http://search.twitter.com/search.rss?q=mongodc – Who’s Tweeting? – What are they Tweeting? – What does basic document analysis of these Tweets tell us?
  • 33. How are Tweeter’s Connected?
  • 34. What are they Tweeting About?
  • 36. Twitter has its Limits…
  • 37. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com