SlideShare a Scribd company logo
1 of 8
Download to read offline
Why Real-Time Analytics?
                                    The Chimp Way:
                                    Using the right tool for each job
Explore the                         At Infochimps, we abide by the philosophy that you should use
                                    the right tool for each job. Why lock in to one set of technologies
technology that                     or techniques? Depending on what you are trying to accomplish
                                    - the questions you want to ask of your data, or the applications
enables real-time                   and visualizations you build on top of that data - different tech-

analytics and                       nologies are best suited for each unique task. You should have all
                                    the best tools at your fingertips for each task. Infochimps excels at
streaming data                      systems and technology integration -- we can take your existing
                                    tools, add powerful new ones from our kit, and glue them together
processing, and                     into a unified whole.

how it differs from
                                    We also strongly embrace open source technologies as part of
the world of                        a complete data solution. Not only do you benefit from the active
                                    participation of the open source community -- you aren’t limited
Hadoop and                          to a proprietary vendor’s finite feature set and integration connec-

batch analytics.                    tors. We use Hadoop, Elasticsearch, Flume, Ironfan, and Wu-
                                    kong, among other world-class open source tools that work flex-
                                    ibly with each other and the rest of the tools in your enterprise.




© 2012 Infochimps, Inc. All rights reserved.                                                         1
The Hadoop & NoSQL conundrum
Hadoop is a powerful framework for Big Data analytics. It simplifies the analysis of massive sets of
data by distributing the computation load across many processes and machines. Hadoop embraces
a map/reduce framework, which means analytics are performed as batch processes. Depending on
the quantity of data and the complexity of the computation, running a set of Hadoop jobs could take
anywhere from a few minutes to many days. Batch analytics tool sets like Hadoop are great for doing
one-off reports, a recurring schedule of periodic runs, or setting up dedicated data exploration envi-
ronments. However, waiting hours for the analysis you need means you aren’t able to get real-time
answers from your data. Hadoop analysis ends up being a rear view mirror instead of a pulse on the
moment.


NoSQL databases are extremely powerful, but come with certain challenges of their own
At Infochimps we use Hadoop to run map/reduce jobs against scalable, NoSQL data stores like
HBase, Cassandra, or Elasticsearch. These databases are extremely good at enabling fast queries
against many terabytes of data, but each makes certain tradeoffs to enable this ability. One major
tradeoff, common across all three of these examples, is the inability to do SQL-like joins -- the ability
to combine data from one database table with data from another table.


The usual way we work around this tradeoff is to practice denormalization. Imagine we’re asking a
question such as “Find all posts that contain the phrase ‘Cola-Cola’ from all authors based in Spo-
kane, Washington”. In a traditional relational database like SQL, a table of “posts” would join against
a table of “authors” using a shared key like an author’s ID number. In NoSQL databases, denormal-
ization consists of inserting a copy of the author into each row of their posts. Rather than joining the
posts table with the authors table during the query a la SQL, all the authors’ data is already contained
within the posts table before the query.


The question then becomes when should the denormalization of our NoSQL database occur? One
option is to use Hadoop to “backfill” denormalized data from normalized tables before running these
kinds of queries. This approach is perfectly workable but it suffers from the same “rear-view mirror”
problem of doing Hadoop-based batch analytics -- we still cannot perform complex queries of real-
time data. What if we could write denormalized data on the fly: write each incoming Twitter post into
a row in the posts table, and augment that row with information on the author in real-time. This would
keep all data denormalized at all times, always ready for downstream applications to run complex
queries and generate the rich, real-time business insights. Real-time analytics and stream processing
make this possible.

© 2012 Infochimps, Inc. All rights reserved.                                                          2
Real-time + Big Data = Stream Processing
In situations where you need to make well-informed, real-time decisions, good data isn’t enough. It
must be timely and actionable. As a mutual fund operator, you can’t wait hours to analyze whether or
not it’s the right moment to sell 200,000 stock shares. As CMO, you can’t wait days to see if there is a
PR crisis occurring around your brand. The time window for data analysis is shrinking, and you need
a different set of tools to get these on-the-fly answers.


Batch Versus Streaming
Consider two hypothetical sandwich makers. Each company makes great sandwiches, but chooses to
deliver them to their customers either in batches or in near real-time.




© 2012 Infochimps, Inc. All rights reserved.                                                        3
The Batch Sub Shop can provide large quantities of sandwiches by leveraging many people to ac-
complish the overall project. Similarly, batch analytics can leverage multiple machines to accomplish
a set of analytics jobs. By adding more resources, we can increase the speed with which the tasks
are accomplished, but at a higher cost.


Contrast that with the Streaming Sub Shop, which doesn’t deliver a huge set of sandwiches all at
once, but does quickly create sandwiches on the fly. The process aims to get a sandwich in the cus-
tomer’s hand as soon as possible. Real-time analytics works the same way by processing data the
moment it is collected. If the data is coming in too quickly, we can flexibly increase the resources that
support our real-time workflow. Is the toasting process the bottleneck of our production line? We eas-
ily add a couple of additional toasters.


As you can imagine, the ideal sandwich company probably combines both the ability to cater large
orders ahead of time and in-store made to order business. Likewise, your organization can leverage
both batch analytics and real-time analytics depending on your business needs. Batch analytics is
the most efficient way to process a large quantity of data in a non-time sensitive manner. Real-time
analytics and stream processing are the answer when the timeliness of your insights is important, you
need to scalably process a very large influx of live data, or if NoSQL databases cannot answer the
questions you are asking.




© 2012 Infochimps, Inc. All rights reserved.                                                          4
How Does Real-Time Analytics Work?




1.	 Collect real-time data. Real-time data is being generated all the time. If you are a mutual fund
    operator, it’s real-time stock price data. If you are a CMO, it’s real-time social media posts and
    Google search results. Typically this data is live streaming data. That means the moment the stock
    price changes, we can grab that data point - like a faucet of running water. We collect live data by
    “hooking a hose up” to the faucet stream to capture that information in real-time. A lot of different
    vocabulary exists to describe these “hoses” including calling them scrapers, collectors, agents,
    and listeners.
2.	 Process the data as it flows in. The key to real-time analytics is that we cannot wait until later to
    do things to our data; we must analyze it instantly. Stream processing (also known as streaming
    data processing) is the term used for doing things to data instantly as it’s collected. Actions that
    you can perform in real-time include splitting data, merging it, doing calculations, connecting it with
    outside data sources, forking data to multiple destinations, and more.
3.	 Reports and dashboards access processed data. Now that data has been processed, it is
    reliably delivered to the databases that power your reports, dashboards, and ad-hoc queries. Just
    seconds after the data was collected, it is now visible in your charts and tables. Since real-time
    analytics and stream processing are flexible frameworks, you can utilize whatever tools you prefer,
    whether that’s Tableau, Pentaho, GoodData, a custom application, or something else. Integration
    is Infochimps’ forté.




© 2012 Infochimps, Inc. All rights reserved.                                                          5
What Can You Do With Stream Processing?
Augment
  •	 Enhance your sales leads - IP addresses of visitors to your website are augmented by the
     “company name” associated with that visitor if they are coming from an enterprise. Email ad-
     dresses get linked to Twitter handles and Facebook handles to help your sales team leverage
     social selling.
  •	 Real-time social media analytics - tweets that mention the brands you are tracking are aug-
     mented with a sentiment score (how positive or negative the comment was) and an influencer
     score (such as Klout). Know instantly if positive news breaks or a PR crisis arises. Instantly
      gain insight into how influential people are and on what topics.
Process and Transform
   •	 On-the-fly analytics reporting - Reformat a tweet on the fly to fit into an agency’s data model so
      that the data is visible in our reporting application immediately upon landing in the database.
   •	 SQL-like data queries - Implement a denormalization policy to allow for doing complex JOIN-
      like queries in real-time in downstream analytics applications.
   •	 Stock price algorithms - Implement your stock analyzer algorithm mid-stream. Instantly after
      an updated stock price is received, the data is processed through the algorithm, and placed in
      your reporting database.
Calculate
   •	 Usage monitoring - Track the number of social media posts mentioning your client company’s
      brand. See at any given moment how much a brand is buzzing, and even set up tiered pricing
      based on how many social posts you are collecting on a client’s behalf.




© 2012 Infochimps, Inc. All rights reserved.                                                        6
Real-time analytics with the Infochimps Platform
Apache Flume
While initially built for log collection and routing, Flume has evolved to confidently serve the roles of
general data transport and streaming data processing. Flume not only reliably delivers data from a
source to a destination. With the right optimizations, a single Flume system can ingest many tera-
bytes of data per day, from thousands of data sources. As data flows in, you can do things to that
data, such as add additional data, do calculations, run algorithms, split data, merge data, etc. In
Flume lingo, these actions are powered by scripts called decorators, which perform the stream pro-
cessing required for real-time analytics.


Infochimps Data Delivery Service
Infochimps uses Apache Flume for the Data Delivery Service (DDS), our reliable data transport and
real-time analytics engine for the Infochimps Platform. Infochimps DDS adds important enhance-
ments to the Flume open-source tool including:

     •	 Seamless integrations with your existing environment
        and data sources
     •	 Optimizations for highly scalable data collection and
        distributed ETL (extract, transform, load)
     •	 Tool set for rapid development of decorators which
        perform the stream processing
     •	 Flexible delivery framework to send data to any type
        and quantity of databases or file systems
     •	 Rapid solution development and deployment, along with
        our expert Big Data methodology and best practices


Infochimps has extensive experience implementing the DDS, both for clients and for our internal data
flows including massive Twitter scrapes, the Foursquare firehose, customer purchase data, product
pricing data, and much more.

Single-purpose ETL solutions are rapidly being replaced with multi-node, multi-purpose data integra-
tion platforms -- the universal glue that connects systems together and makes Big Data analytics
feasible. Today, companies are taking advantage of Amazon Web Services for a few processes, on-
premise or outsourced data centers for others, NoSQL databases, relational databases, cloud storage
-- the list goes on. Data Delivery Service is compatible with all of those environments, making your
data transport needs an implementation detail, not an analytics bottleneck.


© 2012 Infochimps, Inc. All rights reserved.                                                           7
About Infochimps
                                    Our mission is to make the world’s data more accessible.
                                    Infochimps helps companies understand their data. We provide
                                    tools and services that connect their internal data, leverage the
                                    power of cloud computing and new technologies such as Hadoop,
                                    and provide a wealth of external datasets, which organizations
                                    can connect to their own data.


                                    Contact Us
                                    Infochimps, Inc.
                                    1214 W 6th St. Suite 202
                                    Austin, TX 78703


                                    1-855-DATA-FUN (1-855-328-2386)


                                    www.infochimps.com
                                    info@infochimps.com


                                    Twitter: @infochimps




                      Get a free Big Data consultation
                          Let’s talk Big Data in the enterprise!

     Get a free conference with the leading big data experts regarding your enterprise big data
     project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop
     about your project objectives, design, infrastructure, tools, etc. Find out how other compa-
     nies are solving similar problems. Learn best practices and get recommendations — free.




© 2012 Infochimps, Inc. All rights reserved.                                                        8

More Related Content

More from Infochimps, a CSC Big Data Business

[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive AnalyticsInfochimps, a CSC Big Data Business
 
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...Infochimps, a CSC Big Data Business
 

More from Infochimps, a CSC Big Data Business (15)

[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
AHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File SystemsAHUG Presentation: Fun with Hadoop File Systems
AHUG Presentation: Fun with Hadoop File Systems
 
Report: CIOs & Big Data
Report: CIOs & Big DataReport: CIOs & Big Data
Report: CIOs & Big Data
 
Infographic: CIOs & Big Data
Infographic: CIOs & Big DataInfographic: CIOs & Big Data
Infographic: CIOs & Big Data
 
5 Big Data Use Cases for 2013
5 Big Data Use Cases for 20135 Big Data Use Cases for 2013
5 Big Data Use Cases for 2013
 
451 Research Impact Report
451 Research Impact Report451 Research Impact Report
451 Research Impact Report
 
[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics[Webinar] High Speed Retail Analytics
[Webinar] High Speed Retail Analytics
 
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem
 
Taming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel ArchitectureTaming the Big Data Tsunami using Intel Architecture
Taming the Big Data Tsunami using Intel Architecture
 
The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
 
Real-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the AgencyReal-Time Analytics: The Future of Big Data in the Agency
Real-Time Analytics: The Future of Big Data in the Agency
 
Ironfan: Your Foundation for Flexible Big Data Infrastructure
Ironfan: Your Foundation for Flexible Big Data InfrastructureIronfan: Your Foundation for Flexible Big Data Infrastructure
Ironfan: Your Foundation for Flexible Big Data Infrastructure
 
The Power of Elasticsearch
The Power of ElasticsearchThe Power of Elasticsearch
The Power of Elasticsearch
 
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...Case Study: Digital  Agency Turbocharges Social Listening and Insights with t...
Case Study: Digital Agency Turbocharges Social Listening and Insights with t...
 
Meet the Infochimps Platform
Meet the Infochimps PlatformMeet the Infochimps Platform
Meet the Infochimps Platform
 

Recently uploaded

Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 

Recently uploaded (20)

Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 

Why Real-Time Analytics?

  • 1. Why Real-Time Analytics? The Chimp Way: Using the right tool for each job Explore the At Infochimps, we abide by the philosophy that you should use the right tool for each job. Why lock in to one set of technologies technology that or techniques? Depending on what you are trying to accomplish - the questions you want to ask of your data, or the applications enables real-time and visualizations you build on top of that data - different tech- analytics and nologies are best suited for each unique task. You should have all the best tools at your fingertips for each task. Infochimps excels at streaming data systems and technology integration -- we can take your existing tools, add powerful new ones from our kit, and glue them together processing, and into a unified whole. how it differs from We also strongly embrace open source technologies as part of the world of a complete data solution. Not only do you benefit from the active participation of the open source community -- you aren’t limited Hadoop and to a proprietary vendor’s finite feature set and integration connec- batch analytics. tors. We use Hadoop, Elasticsearch, Flume, Ironfan, and Wu- kong, among other world-class open source tools that work flex- ibly with each other and the rest of the tools in your enterprise. © 2012 Infochimps, Inc. All rights reserved. 1
  • 2. The Hadoop & NoSQL conundrum Hadoop is a powerful framework for Big Data analytics. It simplifies the analysis of massive sets of data by distributing the computation load across many processes and machines. Hadoop embraces a map/reduce framework, which means analytics are performed as batch processes. Depending on the quantity of data and the complexity of the computation, running a set of Hadoop jobs could take anywhere from a few minutes to many days. Batch analytics tool sets like Hadoop are great for doing one-off reports, a recurring schedule of periodic runs, or setting up dedicated data exploration envi- ronments. However, waiting hours for the analysis you need means you aren’t able to get real-time answers from your data. Hadoop analysis ends up being a rear view mirror instead of a pulse on the moment. NoSQL databases are extremely powerful, but come with certain challenges of their own At Infochimps we use Hadoop to run map/reduce jobs against scalable, NoSQL data stores like HBase, Cassandra, or Elasticsearch. These databases are extremely good at enabling fast queries against many terabytes of data, but each makes certain tradeoffs to enable this ability. One major tradeoff, common across all three of these examples, is the inability to do SQL-like joins -- the ability to combine data from one database table with data from another table. The usual way we work around this tradeoff is to practice denormalization. Imagine we’re asking a question such as “Find all posts that contain the phrase ‘Cola-Cola’ from all authors based in Spo- kane, Washington”. In a traditional relational database like SQL, a table of “posts” would join against a table of “authors” using a shared key like an author’s ID number. In NoSQL databases, denormal- ization consists of inserting a copy of the author into each row of their posts. Rather than joining the posts table with the authors table during the query a la SQL, all the authors’ data is already contained within the posts table before the query. The question then becomes when should the denormalization of our NoSQL database occur? One option is to use Hadoop to “backfill” denormalized data from normalized tables before running these kinds of queries. This approach is perfectly workable but it suffers from the same “rear-view mirror” problem of doing Hadoop-based batch analytics -- we still cannot perform complex queries of real- time data. What if we could write denormalized data on the fly: write each incoming Twitter post into a row in the posts table, and augment that row with information on the author in real-time. This would keep all data denormalized at all times, always ready for downstream applications to run complex queries and generate the rich, real-time business insights. Real-time analytics and stream processing make this possible. © 2012 Infochimps, Inc. All rights reserved. 2
  • 3. Real-time + Big Data = Stream Processing In situations where you need to make well-informed, real-time decisions, good data isn’t enough. It must be timely and actionable. As a mutual fund operator, you can’t wait hours to analyze whether or not it’s the right moment to sell 200,000 stock shares. As CMO, you can’t wait days to see if there is a PR crisis occurring around your brand. The time window for data analysis is shrinking, and you need a different set of tools to get these on-the-fly answers. Batch Versus Streaming Consider two hypothetical sandwich makers. Each company makes great sandwiches, but chooses to deliver them to their customers either in batches or in near real-time. © 2012 Infochimps, Inc. All rights reserved. 3
  • 4. The Batch Sub Shop can provide large quantities of sandwiches by leveraging many people to ac- complish the overall project. Similarly, batch analytics can leverage multiple machines to accomplish a set of analytics jobs. By adding more resources, we can increase the speed with which the tasks are accomplished, but at a higher cost. Contrast that with the Streaming Sub Shop, which doesn’t deliver a huge set of sandwiches all at once, but does quickly create sandwiches on the fly. The process aims to get a sandwich in the cus- tomer’s hand as soon as possible. Real-time analytics works the same way by processing data the moment it is collected. If the data is coming in too quickly, we can flexibly increase the resources that support our real-time workflow. Is the toasting process the bottleneck of our production line? We eas- ily add a couple of additional toasters. As you can imagine, the ideal sandwich company probably combines both the ability to cater large orders ahead of time and in-store made to order business. Likewise, your organization can leverage both batch analytics and real-time analytics depending on your business needs. Batch analytics is the most efficient way to process a large quantity of data in a non-time sensitive manner. Real-time analytics and stream processing are the answer when the timeliness of your insights is important, you need to scalably process a very large influx of live data, or if NoSQL databases cannot answer the questions you are asking. © 2012 Infochimps, Inc. All rights reserved. 4
  • 5. How Does Real-Time Analytics Work? 1. Collect real-time data. Real-time data is being generated all the time. If you are a mutual fund operator, it’s real-time stock price data. If you are a CMO, it’s real-time social media posts and Google search results. Typically this data is live streaming data. That means the moment the stock price changes, we can grab that data point - like a faucet of running water. We collect live data by “hooking a hose up” to the faucet stream to capture that information in real-time. A lot of different vocabulary exists to describe these “hoses” including calling them scrapers, collectors, agents, and listeners. 2. Process the data as it flows in. The key to real-time analytics is that we cannot wait until later to do things to our data; we must analyze it instantly. Stream processing (also known as streaming data processing) is the term used for doing things to data instantly as it’s collected. Actions that you can perform in real-time include splitting data, merging it, doing calculations, connecting it with outside data sources, forking data to multiple destinations, and more. 3. Reports and dashboards access processed data. Now that data has been processed, it is reliably delivered to the databases that power your reports, dashboards, and ad-hoc queries. Just seconds after the data was collected, it is now visible in your charts and tables. Since real-time analytics and stream processing are flexible frameworks, you can utilize whatever tools you prefer, whether that’s Tableau, Pentaho, GoodData, a custom application, or something else. Integration is Infochimps’ forté. © 2012 Infochimps, Inc. All rights reserved. 5
  • 6. What Can You Do With Stream Processing? Augment • Enhance your sales leads - IP addresses of visitors to your website are augmented by the “company name” associated with that visitor if they are coming from an enterprise. Email ad- dresses get linked to Twitter handles and Facebook handles to help your sales team leverage social selling. • Real-time social media analytics - tweets that mention the brands you are tracking are aug- mented with a sentiment score (how positive or negative the comment was) and an influencer score (such as Klout). Know instantly if positive news breaks or a PR crisis arises. Instantly gain insight into how influential people are and on what topics. Process and Transform • On-the-fly analytics reporting - Reformat a tweet on the fly to fit into an agency’s data model so that the data is visible in our reporting application immediately upon landing in the database. • SQL-like data queries - Implement a denormalization policy to allow for doing complex JOIN- like queries in real-time in downstream analytics applications. • Stock price algorithms - Implement your stock analyzer algorithm mid-stream. Instantly after an updated stock price is received, the data is processed through the algorithm, and placed in your reporting database. Calculate • Usage monitoring - Track the number of social media posts mentioning your client company’s brand. See at any given moment how much a brand is buzzing, and even set up tiered pricing based on how many social posts you are collecting on a client’s behalf. © 2012 Infochimps, Inc. All rights reserved. 6
  • 7. Real-time analytics with the Infochimps Platform Apache Flume While initially built for log collection and routing, Flume has evolved to confidently serve the roles of general data transport and streaming data processing. Flume not only reliably delivers data from a source to a destination. With the right optimizations, a single Flume system can ingest many tera- bytes of data per day, from thousands of data sources. As data flows in, you can do things to that data, such as add additional data, do calculations, run algorithms, split data, merge data, etc. In Flume lingo, these actions are powered by scripts called decorators, which perform the stream pro- cessing required for real-time analytics. Infochimps Data Delivery Service Infochimps uses Apache Flume for the Data Delivery Service (DDS), our reliable data transport and real-time analytics engine for the Infochimps Platform. Infochimps DDS adds important enhance- ments to the Flume open-source tool including: • Seamless integrations with your existing environment and data sources • Optimizations for highly scalable data collection and distributed ETL (extract, transform, load) • Tool set for rapid development of decorators which perform the stream processing • Flexible delivery framework to send data to any type and quantity of databases or file systems • Rapid solution development and deployment, along with our expert Big Data methodology and best practices Infochimps has extensive experience implementing the DDS, both for clients and for our internal data flows including massive Twitter scrapes, the Foursquare firehose, customer purchase data, product pricing data, and much more. Single-purpose ETL solutions are rapidly being replaced with multi-node, multi-purpose data integra- tion platforms -- the universal glue that connects systems together and makes Big Data analytics feasible. Today, companies are taking advantage of Amazon Web Services for a few processes, on- premise or outsourced data centers for others, NoSQL databases, relational databases, cloud storage -- the list goes on. Data Delivery Service is compatible with all of those environments, making your data transport needs an implementation detail, not an analytics bottleneck. © 2012 Infochimps, Inc. All rights reserved. 7
  • 8. About Infochimps Our mission is to make the world’s data more accessible. Infochimps helps companies understand their data. We provide tools and services that connect their internal data, leverage the power of cloud computing and new technologies such as Hadoop, and provide a wealth of external datasets, which organizations can connect to their own data. Contact Us Infochimps, Inc. 1214 W 6th St. Suite 202 Austin, TX 78703 1-855-DATA-FUN (1-855-328-2386) www.infochimps.com info@infochimps.com Twitter: @infochimps Get a free Big Data consultation Let’s talk Big Data in the enterprise! Get a free conference with the leading big data experts regarding your enterprise big data project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop about your project objectives, design, infrastructure, tools, etc. Find out how other compa- nies are solving similar problems. Learn best practices and get recommendations — free. © 2012 Infochimps, Inc. All rights reserved. 8