SlideShare a Scribd company logo
1 of 19
Apache Crunch
SIMPLE AND EFFICIENT MAPREDUCE PIPELINES
THIN VENEER ON TOP OF MAPREDUCE
Aim of Crunch
Main goal of Crunch is to provide a high-level API for writing and testing
complex MapReduce jobs that require multiple processing stages
In other words
Make pipelines that are composed of many user-defined functions simple
to write, easy to test, and efficient to run
Why Crunch?
 A framework for writing, testing and running map reduce pipelines.
 Crunch does not impose a single data type that all of its inputs must conform to. Useful while
processing time series data, serialized object formats, HBase rows and columns, etc.
 Crunch provides a library of patterns to implement common tasks like joining data, performing
aggregations and sorting records.
 Type safety makes it much less likely to make mistakes in your code.
 Simple powerful testing using the supplied MemPipline for fast in-memory unit tests.
 Pluggable execution engines such as MapReduce and Spark which let us keep up to date
with new technology advancements in the big data space without having to rewrite all of our
pipelines with each increment.
 Manages pipeline execution.
Metadata about Crunch
 Modeled after ‘FlumeJava’ by Google.
 Initial coding of Crunch was done by Josh Wills at Cloudera in 2011.
 Under Apache License, Version 2.0
 DoFns are used by Crunch in the same way that MapReduce uses the
Mapper or Reducer classes.
 Runs over Hadoop MapReduce and Apache Spark
Crunch APIs
 Centered around 3 interfaces that represents immutable distributed
datasets
1. PCollection
2. PTable
3. PGroupedTable
PCollection – Lazily evaluated parallel collection
 PCollection<T> represents a distributed, unsorted and immutable
collection of elements of type T.
 E.g.: PCollection<String>
 PCollection<T> provides a method parallelDo, that applies DoFn to each
element in the PCollection<T> in parallel and returns a new PCollection<T>
as its result.
parallelDo
It supports element wise comparison over an input Pcollection<T>
Signature: Collection.parallelDo(<Type>, DoFn, PType)
Pipeline – Source > PType > Target
 Crunch composes of processing the pipelines.
 A pipeline is a programmatic description of a DAG.
 Different pipelines available are:
 MapReduce pipeline
 Memory pipeline
 Spark pipeline
 A pipeline start with a ‘Source’ which is necessary various inputs (At least one source per
pipeline).
 Input sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
 The data from ‘Source’ is read into ‘PType’.
 PType hides the serialization and exposes data in native Java forms.
 The data is persisted into a ‘Target’. (At least one target per pipeline).
 Output sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
DoFn – The data processor
 A simple API to implement
 Used to transform PCollections form one form to another
 DoFn is the location for custom logics
Example:
class example extends DoFn<String, String>{
….
}
The class need to define a method called ‘process()’
public void process(String s, Emitter<String> emitter){
String data = ..;
Emitter.emit(data);
}
This is where we write our custom logic
DoFn runtime processing steps
1. DoFn is given access to ‘TaskInputOutputContext’ implementation for current
task. This allows the DoFn to access any necessary configuration and runtime
information needed before or during processing.
2. DoFn’s ‘initialize’ method is called. Similar to ‘setup’ of Mapper/ Reducer class.
3. Data processing begins. The map/ reduce phase pass the input to the
‘process’ method in DoFn. The output will be captured by ‘Emitter<T>’ which
then can be given to another DoFn or can be serialized and given as output
of the current stage.
4. Cleaning up: Performed by the ‘cleanup’ method. It has two purpose, emit
the state of the DoFn to another DoFn and release any resources on
‘Emitter<T>’ of every DoFn.
Accessing runtime mapreduce APIs
 DoFn provides access to ‘TaskInputOutputContext’ object
 getConfiguration()
 progress()
 setStatus()/getStatus()
 getTaskAttemptID()
 DoFn provide helper methods to work with Hadoop counters, ‘increment’.
The final value of the counter can be retrieved from ‘StageResult’ object.
Common DoFn patterns
Following are the different flavors of MapFn:
 FilterFn – used to accept only those PCollection<T> object that satisfies the
filter condition.
 MapFn – Used in transformations where each input will have exactly one
output.
 CombineFn –used in conjunction with ‘combineValues’ method defined
on the PGroupedTable instance. This is used to perform associative
functions that are performed in the combiner phase of a mapreduce job.
 The associative patterns supported includes sum, counts and unions, via the
‘Aggregator’ interface.
PTable<K,V>
 Sub interface of PCollection<Pair<K,V>>
 Represents a distributed, immutable and unordered multimap of key type
K and value type V
 PTable<K,V> provides parallelDo, groupByKey, join, cogroup operations
 groupByKey operation aggregates all values in the PTable that has the same
values together. (It triggers the sort phase in a MapReduce job)
 Mapside, Bllomfilter and Sharded joins are available.
 The number of reducers and portioning, grouping and sorting strategies
used in shuffle phase can be specified in an instance of GroupingOptions
class which is then given to groupByKey function.
PGroupedTable<K,V>
 The result of groupByKey function is a PGroupedTable<K,V> object, which
is a distributed sorted map of keys of type K to an iterable that may be
iterated once.
 PGroupedTable<K,V> has parallelDo, combinedValues operations
 combinedValues performs a commutative and associative ‘Aggregator’
to be applied to the values in PGroupedTable instance on both the map
and reduce sides of the shuffle
Across various technologies
Concept
Apache
Hadoop
MapReduce
Apache
Crunch
Apache Pig
Apache
Spark
Cascading Apache Hive Apache Tez
Input Data InputFormat Source LoadFunc InputFormat Tap (Source) SerDe Tez Input
Output Data OutputFormat Target StoreFunc OutputFormat Tap (Sink) SerDe Tez Output
Data
Container
Abstraction
N/A PCollection Relation RDD Pipe Table Vertex
Data Format
and
Serialization
Writables
POJOs and
PTypes
Pig Tuples and
Schemas
POJOs and
Java/Kryo
Serialization
Cascading
Tuples and
Schemes
List<Object>
and
ObjectInspect
ors
Events
Data
Transformation
Mapper,
Reducer, and
Combiner
DoFn
PigLatin and
UDFs
Functions
(Java API)
Operations
HiveQL and
UDFs
Processor
Miscellaneous
 Two different serialization frameworks with a number of convenience methods for
defining PTypes:
 Hadoop's ’Writable’ interface
 Apache ‘Avro’ serialization
 Crunch can execute an individual DoFn in either the map or reduce phase of a
MapReduce job, we can also execute multiple DoFn in a single phase.
 Apache Hive and Apache Pig define domain-specific languages (DSLs) that are
intended to make it easy for data analysts to work with data stored in Hadoop,
while Cascading and Apache Crunch develop Java libraries that are aimed at
developers who are building pipelines and applications with a focus on
performance and testability.
Use Case – Log Data Processor
Lets see how the below simple log data processor can be implemented in Crunch
Use Case – Log Data Processor
Crunch implementation of above use case
Crunch Vs Cascading, Pig, Hive
 Developers who tend to think about problems as data flow patterns prefer Crunch and
Pig, while those who think in SQL style prefer Cascading and Hive.
 Crunch supports an in-memory execution engine that can be used to test and debug
pipelines on local data.
 Pig & Cascading uses ‘Tuple model’, however, Crunch uses arbitrary objects.
 Trade off:-
 Simple data type which requires basic in-built functions – use Cascading
 Complex data type requiring more user defined functions – use Crunch
 Compile-time type checking of the Crunch is highly useful.
QUERIES?

More Related Content

What's hot

Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 
Access to non local names
Access to non local namesAccess to non local names
Access to non local names
Varsha Kumar
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
SANKARSAN BOSE
 
Buffer overflow tutorial
Buffer overflow tutorialBuffer overflow tutorial
Buffer overflow tutorial
hughpearse
 

What's hot (20)

Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Linker and loader upload
Linker and loader   uploadLinker and loader   upload
Linker and loader upload
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Access to non local names
Access to non local namesAccess to non local names
Access to non local names
 
Dynamic pricing of Lyft rides using streaming
Dynamic pricing of Lyft rides using streamingDynamic pricing of Lyft rides using streaming
Dynamic pricing of Lyft rides using streaming
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel R
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
 
test
testtest
test
 
Libraries
LibrariesLibraries
Libraries
 
LuaJIT
LuaJITLuaJIT
LuaJIT
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
 
Exploitation Crash Course
Exploitation Crash CourseExploitation Crash Course
Exploitation Crash Course
 
Buffer overflow tutorial
Buffer overflow tutorialBuffer overflow tutorial
Buffer overflow tutorial
 

Similar to Apache Crunch

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 

Similar to Apache Crunch (20)

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Java 8 streams
Java 8 streams Java 8 streams
Java 8 streams
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Map reduce
Map reduceMap reduce
Map reduce
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Recently uploaded (20)

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Apache Crunch

  • 1. Apache Crunch SIMPLE AND EFFICIENT MAPREDUCE PIPELINES THIN VENEER ON TOP OF MAPREDUCE
  • 2. Aim of Crunch Main goal of Crunch is to provide a high-level API for writing and testing complex MapReduce jobs that require multiple processing stages In other words Make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run
  • 3. Why Crunch?  A framework for writing, testing and running map reduce pipelines.  Crunch does not impose a single data type that all of its inputs must conform to. Useful while processing time series data, serialized object formats, HBase rows and columns, etc.  Crunch provides a library of patterns to implement common tasks like joining data, performing aggregations and sorting records.  Type safety makes it much less likely to make mistakes in your code.  Simple powerful testing using the supplied MemPipline for fast in-memory unit tests.  Pluggable execution engines such as MapReduce and Spark which let us keep up to date with new technology advancements in the big data space without having to rewrite all of our pipelines with each increment.  Manages pipeline execution.
  • 4. Metadata about Crunch  Modeled after ‘FlumeJava’ by Google.  Initial coding of Crunch was done by Josh Wills at Cloudera in 2011.  Under Apache License, Version 2.0  DoFns are used by Crunch in the same way that MapReduce uses the Mapper or Reducer classes.  Runs over Hadoop MapReduce and Apache Spark
  • 5. Crunch APIs  Centered around 3 interfaces that represents immutable distributed datasets 1. PCollection 2. PTable 3. PGroupedTable
  • 6. PCollection – Lazily evaluated parallel collection  PCollection<T> represents a distributed, unsorted and immutable collection of elements of type T.  E.g.: PCollection<String>  PCollection<T> provides a method parallelDo, that applies DoFn to each element in the PCollection<T> in parallel and returns a new PCollection<T> as its result. parallelDo It supports element wise comparison over an input Pcollection<T> Signature: Collection.parallelDo(<Type>, DoFn, PType)
  • 7. Pipeline – Source > PType > Target  Crunch composes of processing the pipelines.  A pipeline is a programmatic description of a DAG.  Different pipelines available are:  MapReduce pipeline  Memory pipeline  Spark pipeline  A pipeline start with a ‘Source’ which is necessary various inputs (At least one source per pipeline).  Input sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text  The data from ‘Source’ is read into ‘PType’.  PType hides the serialization and exposes data in native Java forms.  The data is persisted into a ‘Target’. (At least one target per pipeline).  Output sources available are AVRO, parquet, Sequence files, HBase, HFiles, CSV, JDBC, Text
  • 8. DoFn – The data processor  A simple API to implement  Used to transform PCollections form one form to another  DoFn is the location for custom logics Example: class example extends DoFn<String, String>{ …. } The class need to define a method called ‘process()’ public void process(String s, Emitter<String> emitter){ String data = ..; Emitter.emit(data); } This is where we write our custom logic
  • 9. DoFn runtime processing steps 1. DoFn is given access to ‘TaskInputOutputContext’ implementation for current task. This allows the DoFn to access any necessary configuration and runtime information needed before or during processing. 2. DoFn’s ‘initialize’ method is called. Similar to ‘setup’ of Mapper/ Reducer class. 3. Data processing begins. The map/ reduce phase pass the input to the ‘process’ method in DoFn. The output will be captured by ‘Emitter<T>’ which then can be given to another DoFn or can be serialized and given as output of the current stage. 4. Cleaning up: Performed by the ‘cleanup’ method. It has two purpose, emit the state of the DoFn to another DoFn and release any resources on ‘Emitter<T>’ of every DoFn.
  • 10. Accessing runtime mapreduce APIs  DoFn provides access to ‘TaskInputOutputContext’ object  getConfiguration()  progress()  setStatus()/getStatus()  getTaskAttemptID()  DoFn provide helper methods to work with Hadoop counters, ‘increment’. The final value of the counter can be retrieved from ‘StageResult’ object.
  • 11. Common DoFn patterns Following are the different flavors of MapFn:  FilterFn – used to accept only those PCollection<T> object that satisfies the filter condition.  MapFn – Used in transformations where each input will have exactly one output.  CombineFn –used in conjunction with ‘combineValues’ method defined on the PGroupedTable instance. This is used to perform associative functions that are performed in the combiner phase of a mapreduce job.  The associative patterns supported includes sum, counts and unions, via the ‘Aggregator’ interface.
  • 12. PTable<K,V>  Sub interface of PCollection<Pair<K,V>>  Represents a distributed, immutable and unordered multimap of key type K and value type V  PTable<K,V> provides parallelDo, groupByKey, join, cogroup operations  groupByKey operation aggregates all values in the PTable that has the same values together. (It triggers the sort phase in a MapReduce job)  Mapside, Bllomfilter and Sharded joins are available.  The number of reducers and portioning, grouping and sorting strategies used in shuffle phase can be specified in an instance of GroupingOptions class which is then given to groupByKey function.
  • 13. PGroupedTable<K,V>  The result of groupByKey function is a PGroupedTable<K,V> object, which is a distributed sorted map of keys of type K to an iterable that may be iterated once.  PGroupedTable<K,V> has parallelDo, combinedValues operations  combinedValues performs a commutative and associative ‘Aggregator’ to be applied to the values in PGroupedTable instance on both the map and reduce sides of the shuffle
  • 14. Across various technologies Concept Apache Hadoop MapReduce Apache Crunch Apache Pig Apache Spark Cascading Apache Hive Apache Tez Input Data InputFormat Source LoadFunc InputFormat Tap (Source) SerDe Tez Input Output Data OutputFormat Target StoreFunc OutputFormat Tap (Sink) SerDe Tez Output Data Container Abstraction N/A PCollection Relation RDD Pipe Table Vertex Data Format and Serialization Writables POJOs and PTypes Pig Tuples and Schemas POJOs and Java/Kryo Serialization Cascading Tuples and Schemes List<Object> and ObjectInspect ors Events Data Transformation Mapper, Reducer, and Combiner DoFn PigLatin and UDFs Functions (Java API) Operations HiveQL and UDFs Processor
  • 15. Miscellaneous  Two different serialization frameworks with a number of convenience methods for defining PTypes:  Hadoop's ’Writable’ interface  Apache ‘Avro’ serialization  Crunch can execute an individual DoFn in either the map or reduce phase of a MapReduce job, we can also execute multiple DoFn in a single phase.  Apache Hive and Apache Pig define domain-specific languages (DSLs) that are intended to make it easy for data analysts to work with data stored in Hadoop, while Cascading and Apache Crunch develop Java libraries that are aimed at developers who are building pipelines and applications with a focus on performance and testability.
  • 16. Use Case – Log Data Processor Lets see how the below simple log data processor can be implemented in Crunch
  • 17. Use Case – Log Data Processor Crunch implementation of above use case
  • 18. Crunch Vs Cascading, Pig, Hive  Developers who tend to think about problems as data flow patterns prefer Crunch and Pig, while those who think in SQL style prefer Cascading and Hive.  Crunch supports an in-memory execution engine that can be used to test and debug pipelines on local data.  Pig & Cascading uses ‘Tuple model’, however, Crunch uses arbitrary objects.  Trade off:-  Simple data type which requires basic in-built functions – use Cascading  Complex data type requiring more user defined functions – use Crunch  Compile-time type checking of the Crunch is highly useful.