Inside Flume

Inside Flume

Henry Robinson
henry@cloudera.com
@henryr

Tuesday, 17 August 2010

Who am I?

• Distributed systems guy

• Apache ZooKeeper committer

• I work at Cloudera on Flume, ZooKeeper, Hue, more...

• p.s. Cloudera is hiring!

Copyright 2010 Cloudera Inc. All rights reserved


About Cloudera

• Software, services and support for Hadoop
• Built around an open core
• All our patches get contributed upstream
• Flume and Hue are open-source
• We just started the Whirr project
• We maintain, package and support Cloudera’s Distribution
for Hadoop
• Smoothing off a lot of the rough edges around Hadoop
• Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive,
Pig, Hue, Flume and more.



What’s the problem?

• Data collection is currently a priori and ad hoc

• A priori - decide what you want to collect ahead of time

• Ad hoc - Each kind of data source goes through its own
collection path
• Usually a collection of fragile, custom scripts



What is Flume? (and how can it help?)

• Flume is:
• A distributed data collection service
• Scalable
• Conﬁgurable
• Extensible
• Manageable
• Open source
• How can it help?
• One-stop solution for data collection of all formats
• Flexible reliability guarantees allow careful performance tuning
• Enables quick iteration on new collection strategies


The Flume Model

• Built around the concept of flows
• A single flow corresponds to a type of data source
• Like web server logs
• Or machine monitoring metrics
• Different flows might have different compression,
batching or reliability setups
• Flume multiplexes many flows onto one service instance
• Flows are comprised of nodes chained together
• Each Flume process can run many nodes, so resources are
shared
• Each node receives data at its source, and sends it to its sink


Flume Flows

• Three typical ﬂows, all on the same Flume service

Flow 1: Web-clicks
Reliable Delivery, Compressed, Batched
EV
A EN
D AT TS

DATA Flow 2: Process monitoring EVENTS
Best Effort Delivery

DA
TA N TS
E VE

Flow 3: Advert Impressions
Reliable Delivery



Anatomy of a Flume node

• Data come in through a source...
• ... are optionally processed by one or more decorators...
• ... and then are transmitted out via a sink
• Each of these components is (re-)configurable at run-
time
• Each has a very simple API, and a plugin interface that
makes customizing Flume very easy
• These simple abstractions are sufficient to build more
complex features like acknowledged delivery, filtering,
compression



Agents and Collectors

• Nodes that receive data from an application are called
agents
• Flume supports many sources for agents, including:
• Syslog
• Tailing a ﬁle
• Unix processes
• Scribe API
• Twitter
• Nodes that write data to permanent storage are called
collectors
• Most often they write to HDFS


Flume Nodes Source
Agent
Sink

HTTPD Tail Apache Downstream
HTTPD logs processor node

• Each role may be
played by many
Processor
different nodes Source Decorator Sink
Extract browser
Upstream agent name from log string Downstream
node and attach it to event collector node

• Usually require
substantially fewer
collectors than agents Collector
Source Sink
HDFS://
Upstream namenode/ S
HDF
processor node /weblogs/
%{browser}/



Flume Events

• All data are transformed into a series of events

• Events are a pair (body, metadata)

• Body is a string of bytes

• Metadata is a table mapping keys to values
• Flume can use this to inform processing
• Or simply write it with the event



The Flume Configuration Language

• Node configurations are written in a simple language
• my-flume-node : src | { decorator => sink }
• For example: a configuration to read HTTP log data from
a file and send it to a collector:
• web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
• On the collector, receive data and bucket it according to
browser:
• web-log-collector : autoCollectorSource
| { regex(“(Firefox|Internet Explorer)”, “browser”) =>
collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• Two lines to set-up an entire flow


Keeping Track of Nodes

• The master service monitors all Flume nodes
• A single port-of-call for checking on the health of your Flume
service
• Send commands to the master, and it will forward them
to the nodes
• The Flume Shell is a convenient, scriptable command-line
tool
• Web-based UIs are also available



Flume as a Distributed System

• Fundamental principle: Keep state out of the data path
where possible
• Replication is costly
• Consistency is problematic
• Global knowledge is impractical
• Follow the end-to-end principle - put smarts at the edges
• Advantages
• Failures become much cheaper
• Performance is better
• Disadvantages
• Have to weaken some delivery guarantees


Scalability and reliability in Flume

• The data path is ‘horizontally scalable’
• Add more machines, get more performance
• Typically the bottleneck is write performance at the collector
• If machines fail, others automatically take their place
• The master only requires a few machines
• Consistency and replication handled by ZooKeeper + gossip
• A cluster of ﬁve or seven machines can handle thousands of
nodes
• Can add more if you manage to hit the limit



Flume as Open Source

• http://github.com/cloudera/ﬂume
• Already vibrant contributor community
• Flume 0.9.1 is at release candidate 0 right now

• Cloudera provides
• Packages
• Standardisation
• Support



Inside Flume

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Inside Flume

Similar to Inside Flume (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Inside Flume