Generative AI for Technical Writer or Information Developers
Inside Flume
1. Inside Flume
Henry Robinson
henry@cloudera.com
@henryr
Tuesday, 17 August 2010
2. Who am I?
• Distributed systems guy
• Apache ZooKeeper committer
• I work at Cloudera on Flume, ZooKeeper, Hue, more...
• p.s. Cloudera is hiring!
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
3. About Cloudera
• Software, services and support for Hadoop
• Built around an open core
• All our patches get contributed upstream
• Flume and Hue are open-source
• We just started the Whirr project
• We maintain, package and support Cloudera’s Distribution
for Hadoop
• Smoothing off a lot of the rough edges around Hadoop
• Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive,
Pig, Hue, Flume and more.
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
4. What’s the problem?
• Data collection is currently a priori and ad hoc
• A priori - decide what you want to collect ahead of time
• Ad hoc - Each kind of data source goes through its own
collection path
• Usually a collection of fragile, custom scripts
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
5. What is Flume? (and how can it help?)
• Flume is:
• A distributed data collection service
• Scalable
• Configurable
• Extensible
• Manageable
• Open source
• How can it help?
• One-stop solution for data collection of all formats
• Flexible reliability guarantees allow careful performance tuning
• Enables quick iteration on new collection strategies
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
6. The Flume Model
• Built around the concept of flows
• A single flow corresponds to a type of data source
• Like web server logs
• Or machine monitoring metrics
• Different flows might have different compression,
batching or reliability setups
• Flume multiplexes many flows onto one service instance
• Flows are comprised of nodes chained together
• Each Flume process can run many nodes, so resources are
shared
• Each node receives data at its source, and sends it to its sink
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
7. Flume Flows
• Three typical flows, all on the same Flume service
Flow 1: Web-clicks
Reliable Delivery, Compressed, Batched
EV
A EN
D AT TS
DATA Flow 2: Process monitoring EVENTS
Best Effort Delivery
DA
TA N TS
E VE
Flow 3: Advert Impressions
Reliable Delivery
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
8. Anatomy of a Flume node
• Data come in through a source...
• ... are optionally processed by one or more decorators...
• ... and then are transmitted out via a sink
• Each of these components is (re-)configurable at run-
time
• Each has a very simple API, and a plugin interface that
makes customizing Flume very easy
• These simple abstractions are sufficient to build more
complex features like acknowledged delivery, filtering,
compression
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
9. Agents and Collectors
• Nodes that receive data from an application are called
agents
• Flume supports many sources for agents, including:
• Syslog
• Tailing a file
• Unix processes
• Scribe API
• Twitter
• Nodes that write data to permanent storage are called
collectors
• Most often they write to HDFS
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
10. Flume Nodes Source
Agent
Sink
HTTPD Tail Apache Downstream
HTTPD logs processor node
• Each role may be
played by many
Processor
different nodes Source Decorator Sink
Extract browser
Upstream agent name from log string Downstream
node and attach it to event collector node
• Usually require
substantially fewer
collectors than agents Collector
Source Sink
HDFS://
Upstream namenode/ S
HDF
processor node /weblogs/
%{browser}/
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
11. Flume Events
• All data are transformed into a series of events
• Events are a pair (body, metadata)
• Body is a string of bytes
• Metadata is a table mapping keys to values
• Flume can use this to inform processing
• Or simply write it with the event
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
12. The Flume Configuration Language
• Node configurations are written in a simple language
• my-flume-node : src | { decorator => sink }
• For example: a configuration to read HTTP log data from
a file and send it to a collector:
• web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
• On the collector, receive data and bucket it according to
browser:
• web-log-collector : autoCollectorSource
| { regex(“(Firefox|Internet Explorer)”, “browser”) =>
collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• Two lines to set-up an entire flow
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
13. Keeping Track of Nodes
• The master service monitors all Flume nodes
• A single port-of-call for checking on the health of your Flume
service
• Send commands to the master, and it will forward them
to the nodes
• The Flume Shell is a convenient, scriptable command-line
tool
• Web-based UIs are also available
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
14. Flume as a Distributed System
• Fundamental principle: Keep state out of the data path
where possible
• Replication is costly
• Consistency is problematic
• Global knowledge is impractical
• Follow the end-to-end principle - put smarts at the edges
• Advantages
• Failures become much cheaper
• Performance is better
• Disadvantages
• Have to weaken some delivery guarantees
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
15. Scalability and reliability in Flume
• The data path is ‘horizontally scalable’
• Add more machines, get more performance
• Typically the bottleneck is write performance at the collector
• If machines fail, others automatically take their place
• The master only requires a few machines
• Consistency and replication handled by ZooKeeper + gossip
• A cluster of five or seven machines can handle thousands of
nodes
• Can add more if you manage to hit the limit
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010
16. Flume as Open Source
• http://github.com/cloudera/flume
• Already vibrant contributor community
• Flume 0.9.1 is at release candidate 0 right now
• Cloudera provides
• Packages
• Standardisation
• Support
Copyright 2010 Cloudera Inc. All rights reserved
Tuesday, 17 August 2010