This document discusses using Apache Falcon's Pipeline Designer for big data ETL. It provides an overview of the key concepts in Pipeline Designer including feeds, processes, actions, transforms, and deployment. Pipeline Designer aims to simplify authoring of ETL workflows for big data by providing a visual interface and compiling transformations into Pig scripts to be executed by Falcon.
10. Agenda
! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
11. Apache Falcon
! Off the shelf, Falcon provides standard data
management functions through declarative
constructs
! Data movement recipes
! Cross data center replication
! Cross cluster data synchronization
! Data retention recipes
! Eviction
! Archival
12. Apache Falcon
! However ETL related functions are still largely left
to the developer to implement. Falcon today
manages only
! Orchestration
! Late data handling / Change data capture
! Retries
! Monitoring
13. Agenda
! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
15. Pipeline Designer – Basics
! Feed
! Is a data entity that Falcon manages and is physically
present in a cluster.
! Data present in this feed conforms to a schema and
partitions of the same are registered with Hcatalog
! Data Management functions such as eviction, archival
etc are declaratively specified through Falcon Feed
definitions
17. Pipeline Designer – Basics
! Process
! Workflow that defines various actions that needs to be
performed along with control flow
! Executes at a specified frequency on one or more
clusters
! Pipelines
! Logical grouping of Falcon processes owned and
operated together
19. Pipeline Designer – Basics
! Actions
! Actions in designer are the building blocks for the
process workflows.
! Actions have access to output variables earlier in the
flow and can emit output variables
! Actions can transition to other actions
! Default / Success Transition
! Failure Transition
! Conditional Transition
! Transformation action is a special action that further
is a collection of transforms
21. Pipeline Designer – Basics
! Transforms
! Is a data manipulation function that accepts one or
more inputs with well defined schema and produces
ore or more outputs
! Multiple transform elements can be stitched together
to compose a single transformation action which can
further be used to build a flow
! Composite Transformations
! Transforms that are built through a combination of
multiple primitive transforms
! Possible to add more transforms and extend the
system
22. Pipeline Designer – Basics
! Deployment & Monitoring
! Once a process and the pipeline is composed, the
same is deployed in Falcon as a standard process
23. Agenda
! ETL & Challenges with Big Data
! Apache Falcon – Background
! Pipeline Designer – Overview
! Pipeline Designer – Internals
24. Pipeline Designer Service
Pipeline Designer
Pipeline
Designer
Service
REST API
Versioned
Storage
Flow /
Action /
Transforms
Compiler +
Optimizer
Falcon
Server
Hcatalog
Service
DesignerUI
FalconDashboard
Process
Feed
Schema
25. Pipeline Designer – Internals
! Transformation actions are compiled into PIG
scripts
! Actions and Flows are compiled into Falcon Process
definitions