SlideShare a Scribd company logo
1 of 17
Download to read offline
Inside Flume

                            Henry Robinson
                          henry@cloudera.com
                               @henryr




Tuesday, 17 August 2010
Who am I?

  • Distributed systems guy

  • Apache ZooKeeper committer

  • I work at Cloudera on Flume, ZooKeeper, Hue, more...

  • p.s. Cloudera is hiring!




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
About Cloudera

  • Software, services and support for Hadoop
  • Built around an open core
        • All our patches get contributed upstream
        • Flume and Hue are open-source
        • We just started the Whirr project
  • We maintain, package and support Cloudera’s Distribution
    for Hadoop
        • Smoothing off a lot of the rough edges around Hadoop
        • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive,
          Pig, Hue, Flume and more.


                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
What’s the problem?

  • Data collection is currently a priori and ad hoc

  • A priori - decide what you want to collect ahead of time

  • Ad hoc - Each kind of data source goes through its own
    collection path
        • Usually a collection of fragile, custom scripts




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
What is Flume? (and how can it help?)

  • Flume is:
        •   A distributed data collection service
        •   Scalable
        •   Configurable
        •   Extensible
        •   Manageable
        •   Open source
  • How can it help?
        • One-stop solution for data collection of all formats
        • Flexible reliability guarantees allow careful performance tuning
        • Enables quick iteration on new collection strategies
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
The Flume Model

  • Built around the concept of flows
  • A single flow corresponds to a type of data source
        • Like web server logs
        • Or machine monitoring metrics
  • Different flows might have different compression,
    batching or reliability setups
        • Flume multiplexes many flows onto one service instance
  • Flows are comprised of nodes chained together
        • Each Flume process can run many nodes, so resources are
          shared
        • Each node receives data at its source, and sends it to its sink
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Flows

  • Three typical flows, all on the same Flume service


                               Flow 1: Web-clicks
                            Reliable Delivery, Compressed, Batched
                                                                                EV
              A                                                                    EN
          D AT                                                                        TS



          DATA            Flow 2: Process monitoring                            EVENTS
                                       Best Effort Delivery

          DA
            TA                                                                         N   TS
                                                                                E   VE

                          Flow 3: Advert Impressions
                                         Reliable Delivery




                             Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Anatomy of a Flume node

  • Data come in through a source...
  • ... are optionally processed by one or more decorators...
  • ... and then are transmitted out via a sink
  • Each of these components is (re-)configurable at run-
    time
  • Each has a very simple API, and a plugin interface that
    makes customizing Flume very easy
  • These simple abstractions are sufficient to build more
    complex features like acknowledged delivery, filtering,
    compression

                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Agents and Collectors

  • Nodes that receive data from an application are called
    agents
  • Flume supports many sources for agents, including:
        •   Syslog
        •   Tailing a file
        •   Unix processes
        •   Scribe API
        •   Twitter
  • Nodes that write data to permanent storage are called
    collectors
        • Most often they write to HDFS
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Nodes                                          Source
                                                                      Agent
                                                                                   Sink

                                     HTTPD           Tail Apache             Downstream
                                                     HTTPD logs             processor node



  • Each role may be
    played by many
                                                                   Processor
    different nodes                        Source                 Decorator                    Sink
                                                              Extract browser
                                      Upstream agent        name from log string           Downstream
                                           node             and attach it to event        collector node


  • Usually require
    substantially fewer
    collectors than agents                                           Collector
                                                        Source                       Sink
                                                                                   HDFS://
                                                       Upstream                  namenode/                  S
                                                                                                      HDF
                                                    processor node                /weblogs/
                                                                                 %{browser}/



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Events

  • All data are transformed into a series of events

  • Events are a pair (body, metadata)

  • Body is a string of bytes

  • Metadata is a table mapping keys to values
        • Flume can use this to inform processing
        • Or simply write it with the event


                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
The Flume Configuration Language

  • Node configurations are written in a simple language
        • my-flume-node : src | { decorator => sink }
  • For example: a configuration to read HTTP log data from
    a file and send it to a collector:
        • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
  • On the collector, receive data and bucket it according to
    browser:
        • web-log-collector : autoCollectorSource
          | { regex(“(Firefox|Internet Explorer)”, “browser”) =>
          collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
  • Two lines to set-up an entire flow
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Keeping Track of Nodes

  • The master service monitors all Flume nodes
        • A single port-of-call for checking on the health of your Flume
          service
  • Send commands to the master, and it will forward them
    to the nodes
  • The Flume Shell is a convenient, scriptable command-line
    tool
  • Web-based UIs are also available



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume as a Distributed System

  • Fundamental principle: Keep state out of the data path
    where possible
        •   Replication is costly
        •   Consistency is problematic
        •   Global knowledge is impractical
        •   Follow the end-to-end principle - put smarts at the edges
  • Advantages
        • Failures become much cheaper
        • Performance is better
  • Disadvantages
        • Have to weaken some delivery guarantees
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Scalability and reliability in Flume

  • The data path is ‘horizontally scalable’
        • Add more machines, get more performance
        • Typically the bottleneck is write performance at the collector
        • If machines fail, others automatically take their place
  • The master only requires a few machines
        • Consistency and replication handled by ZooKeeper + gossip
        • A cluster of five or seven machines can handle thousands of
          nodes
        • Can add more if you manage to hit the limit



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume as Open Source

  • http://github.com/cloudera/flume
  • Already vibrant contributor community
  • Flume 0.9.1 is at release candidate 0 right now

  • Cloudera provides
        • Packages
        • Standardisation
        • Support




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010

More Related Content

What's hot

Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insightsOmid Vahdaty
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016Jayesh Thakrar
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Flume-Cassandra Log Processor
Flume-Cassandra Log ProcessorFlume-Cassandra Log Processor
Flume-Cassandra Log ProcessorCLOUDIAN KK
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Steve Hoffman
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Cloudera, Inc.
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkStreamNative
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
 

What's hot (20)

Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insights
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Flume-Cassandra Log Processor
Flume-Cassandra Log ProcessorFlume-Cassandra Log Processor
Flume-Cassandra Log Processor
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Flume basic
Flume basicFlume basic
Flume basic
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache Flink
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
 

Similar to Inside Flume

Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01joahp
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Hail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open sourceHail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open sourceTimothy Spann
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeTimothy Spann
 
Apache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiApache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiTimothy Spann
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOSconN Masahiro
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 

Similar to Inside Flume (20)

Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Flumetalk
FlumetalkFlumetalk
Flumetalk
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Hail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open sourceHail hydrate! from stream to lake using open source
Hail hydrate! from stream to lake using open source
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Apache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiApache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFi
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOScon
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Inside Flume

  • 1. Inside Flume Henry Robinson henry@cloudera.com @henryr Tuesday, 17 August 2010
  • 2. Who am I? • Distributed systems guy • Apache ZooKeeper committer • I work at Cloudera on Flume, ZooKeeper, Hue, more... • p.s. Cloudera is hiring! Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 3. About Cloudera • Software, services and support for Hadoop • Built around an open core • All our patches get contributed upstream • Flume and Hue are open-source • We just started the Whirr project • We maintain, package and support Cloudera’s Distribution for Hadoop • Smoothing off a lot of the rough edges around Hadoop • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive, Pig, Hue, Flume and more. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 4. What’s the problem? • Data collection is currently a priori and ad hoc • A priori - decide what you want to collect ahead of time • Ad hoc - Each kind of data source goes through its own collection path • Usually a collection of fragile, custom scripts Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 5. What is Flume? (and how can it help?) • Flume is: • A distributed data collection service • Scalable • Configurable • Extensible • Manageable • Open source • How can it help? • One-stop solution for data collection of all formats • Flexible reliability guarantees allow careful performance tuning • Enables quick iteration on new collection strategies Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 6. The Flume Model • Built around the concept of flows • A single flow corresponds to a type of data source • Like web server logs • Or machine monitoring metrics • Different flows might have different compression, batching or reliability setups • Flume multiplexes many flows onto one service instance • Flows are comprised of nodes chained together • Each Flume process can run many nodes, so resources are shared • Each node receives data at its source, and sends it to its sink Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 7. Flume Flows • Three typical flows, all on the same Flume service Flow 1: Web-clicks Reliable Delivery, Compressed, Batched EV A EN D AT TS DATA Flow 2: Process monitoring EVENTS Best Effort Delivery DA TA N TS E VE Flow 3: Advert Impressions Reliable Delivery Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 8. Anatomy of a Flume node • Data come in through a source... • ... are optionally processed by one or more decorators... • ... and then are transmitted out via a sink • Each of these components is (re-)configurable at run- time • Each has a very simple API, and a plugin interface that makes customizing Flume very easy • These simple abstractions are sufficient to build more complex features like acknowledged delivery, filtering, compression Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 9. Agents and Collectors • Nodes that receive data from an application are called agents • Flume supports many sources for agents, including: • Syslog • Tailing a file • Unix processes • Scribe API • Twitter • Nodes that write data to permanent storage are called collectors • Most often they write to HDFS Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 10. Flume Nodes Source Agent Sink HTTPD Tail Apache Downstream HTTPD logs processor node • Each role may be played by many Processor different nodes Source Decorator Sink Extract browser Upstream agent name from log string Downstream node and attach it to event collector node • Usually require substantially fewer collectors than agents Collector Source Sink HDFS:// Upstream namenode/ S HDF processor node /weblogs/ %{browser}/ Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 11. Flume Events • All data are transformed into a series of events • Events are a pair (body, metadata) • Body is a string of bytes • Metadata is a table mapping keys to values • Flume can use this to inform processing • Or simply write it with the event Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 12. The Flume Configuration Language • Node configurations are written in a simple language • my-flume-node : src | { decorator => sink } • For example: a configuration to read HTTP log data from a file and send it to a collector: • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink • On the collector, receive data and bucket it according to browser: • web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } • Two lines to set-up an entire flow Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 13. Keeping Track of Nodes • The master service monitors all Flume nodes • A single port-of-call for checking on the health of your Flume service • Send commands to the master, and it will forward them to the nodes • The Flume Shell is a convenient, scriptable command-line tool • Web-based UIs are also available Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 14. Flume as a Distributed System • Fundamental principle: Keep state out of the data path where possible • Replication is costly • Consistency is problematic • Global knowledge is impractical • Follow the end-to-end principle - put smarts at the edges • Advantages • Failures become much cheaper • Performance is better • Disadvantages • Have to weaken some delivery guarantees Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 15. Scalability and reliability in Flume • The data path is ‘horizontally scalable’ • Add more machines, get more performance • Typically the bottleneck is write performance at the collector • If machines fail, others automatically take their place • The master only requires a few machines • Consistency and replication handled by ZooKeeper + gossip • A cluster of five or seven machines can handle thousands of nodes • Can add more if you manage to hit the limit Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 16. Flume as Open Source • http://github.com/cloudera/flume • Already vibrant contributor community • Flume 0.9.1 is at release candidate 0 right now • Cloudera provides • Packages • Standardisation • Support Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 17. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010