This document provides an overview and comparison of Pig, Hive, and Cascading tools for Hadoop. It begins with brief histories of each tool's development: Pig was created at Yahoo Research in 2006 to enable log analytics; Hive was developed by Facebook in 2007 to provide SQL-like queries over Hadoop data; and Cascading was authored in 2008 and associated with Scalding and Cascalog projects. The document then compares features of the tools such as their procedural versus declarative programming models, data typing approaches, integration capabilities, and performance/optimization characteristics to help users choose the best technology.
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Hadoop Batch Tools Pig Hive Cascading
1. Hadoop Is A Batch
Pig, Hive, Cascading …
Paris Jug May 2013
Florian Douetteau
2. Florian Douetteau <florian.douetteau@dataiku.com>
CEO at Dataiku
Freelance at Criteo (Online Ads)
CTO at IsCool Ent. (#1 French Social Gamer)
VP R&D Exalead (Search Engine Technology)
About me
15/05/2013Dataiku Training – Hadoop for Data Science 2
3. Hadoop and Context (->0:03)
Pig, Hive, Cascading, … (->0:09)
How they work (->0:15)
Comparing the tools (->0:35)
Make them work together (->0:40)
Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
4. CHOOSE TECHNOLOGY
Dataiku - Pig, Hive and Cascading
Hadoop
Ceph
Sphere
Cassandra
Spark
Scikit-Learn
Mahout
WEKA
MLBase LibSVM
SAS
RapidMiner
SPSS
Panda
QlickView
Tableau
SpotFire
HTML5/D3
InfiniDB
Vertica
GreenPlum
Impala
Netezza
Elastic Search
SOLR
MongoDB
Riak
Membase
Pig
Cascading
Talend
Machine Learning
Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County
Data Clean Wasteland
Statistician Old
House
R
5. How do I (pre)process data?
Implicit User Data
(Views, Searches…)
Content Data
(Title, Categories, Price, …)
Dataiku - Pig, Hive and Cascading
Explicit User Data
(Click, Buy, …)
User Information
(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation
Matrix
Transformation
Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
A/B Test Data
Predictor Runtime
Online User Information
6. Analyse Raw Logs
(Trackers, Web Logs)
Extract IP, Page, …
Detect and remove
robots
Build Statistics
◦ Number of page view, per
produt
◦ Best Referers
◦ Traffic Analysis
◦ Funnel
◦ SEO Analysis
◦ …
Dataiku - Pig, Hive and Cascading
Typical Use Case 1
Web Analytics Processing
7. Extract Query Logs
Perform query
normalization
Compute Ngrams
Compute Search
“Sessions”
Compute Log-
Likehood Ratio for
ngrams across
sesions
Dataiku - Pig, Hive and Cascading
Typical Use Case 2
Mining Search Logs for Synonyms
8. Compute User –
Product Association
Matrix
Compute different
similarities ratio
(Ochiai, Cosine, …)
Filter out bad
predictions
For each user, select
best recommendable
products
Dataiku - Pig, Hive and Cascading
Typical Use Case 3
Product Recommender
9. Hadoop and Context (->0:03)
Pig, Hive, Cascading, … (->0:09)
How they work (->0:15)
Comparing the tools (->0:35)
Make them work together (->0:40)
Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
10. Yahoo Research in 2006
Inspired from Sawzall, a Google Paper
from 2003
2007 as an Apache Project
Initial motivation
◦ Search Log Analytics: how long is the
average user session ? how many links does
a user click ? on before leaving a website ?
how do click patterns vary in the course of a
day/week/month ? …
Pig History
Dataiku - Pig, Hive and Cascading
words = LOAD '/training/hadoop-wordcount/output‘
USING PigStorage(‘t’)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
11. Developed by Facebook in January 2007
Open source in August 2008
Initial Motivation
◦ Provide a SQL like abstraction to perform
statistics on status updates
Hive History
Dataiku - Pig, Hive and Cascading
create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
‘th%’;
12. Authored by Chris Wensel 2008
Associated Projects
◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter
in 2012)
◦ Lingual ( to be released soon): SQL
layer on top of cascading
Cascading History
Dataiku - Pig, Hive and Cascading
13. Hadoop and Context (->0:03)
Pig, Hive, Cascading, … (->0:09)
How they work (->0:15)
Comparing the tools (->0:35)
Make them work together (->0:40)
Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
15. Pig & Hive
Mapping to Mapreduce jobs
5/15/2013Dataiku - Innovation Services 15
* VAT excluded
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
Job 1 : Mapper Job 1 : Reducer1
LOAD FILTER GROUP FOREACH FILTER
Shuffle and
sort by user
16. Pig & Hive
Mapping to Mapreduce jobs
5/15/2013Dataiku - Innovation Services 16
events = LOAD ‘/events’ USING PigStorage(‘t’) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
recent_high = ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO ‘/output’;
Job 1: Mapper Job 1 :Reducer
LOAD FILTER GROUP FOREACH FILTER
Shuffle and
sort by user
Job 2: Mapper Job 2: Reducer
LOAD
(from tmp)
STOREShuffle and
sort by max_ts
17. Pig
How does it work
Dataiku - Pig, Hive and Cascading
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)
84 TResolution = LOAD '$PREFIX/dwh_dim_external_tracking_resolution/dt=$DAY' USING PigStorage('u0001');
85 TResolution = FOREACH TResolution GENERATE $0 AS SKResolutionId, $1 as ResolutionId;
86
87
88 TSiteMap = LOAD '$PREFIX/dwh_dim_sitemapnode/dt=$DAY' USING PigStorage('u0001');
89 TSiteMap = FOREACH TSiteMap GENERATE $0 AS SKSimteMapNodeId, $2 as SiteMapNodeId;
90
91
92 TCustomer = LOAD '$PREFIX/customer_relation/dt=$DAY' USING PigStorage('u0001')
93 as (SKCustomerId:chararray,
94 CustomerId:chararray);
95
96 F1 = FOREACH F1 GENERATE *, (date_time IS NOT NULL ? CustomFormatToISO(date_time, 'yyyy-MM-dd HH:mm:ss'
97
98 F2 = FOREACH F1 GENERATE *,
99 CONCAT(CONCAT(CONCAT(CONCAT(visid_high,'-'), visid_low), '-'), visit_num) as VisitId,
100 (referrer matches '.*cdiscount.com.*' OR referrer matches 'cdscdn.com' ? NULL :referrer ) as Referrer,
101 (iso IS NOT NULL ? ISODaysBetween(iso, '1899-12-31T00:00:00') : NULL)
102 AS SkDateId,
103 (iso IS NOT NULL ? ISOSecondsBetween(iso, ISOToDay(iso)) : NULL)
104 AS SkTimeId,
105 ((event_list is not null and event_list matches '.*b202b.*') ? 'Y' : 'N') as is_202,
106 ((event_list is not null and event_list matches '.*b10b.*') ? 'Y' : 'N') as is_10,
107 ((event_list is not null and event_list matches '.*b12b.*') ? 'Y' : 'N') as is_12,
108 ((event_list is not null and event_list matches '.*b13b.*') ? 'Y' : 'N') as is_13,
109 ((event_list is not null and event_list matches '.*b14b.*') ? 'Y' : 'N') as is_14,
110 ((event_list is not null and event_list matches '.*b11b.*') ? 'Y' : 'N') as is_11,
111 ((event_list is not null and event_list matches '.*b1b.*') ? 'Y' : 'N') as is_1,
112 REGEX_EXTRACT(pagename, 'F-(.*):.*', 1) AS ProductReferenceId,
113 NULL AS OriginFile;
114
115 SET DEFAULT_PARALLEL 24;
116
117 F3 = JOIN F2 BY post_search_engine LEFT, TSearchEngine BY SearchEngineId USING 'replicated' PARALLEL 20 ;
118 F3 = FOREACH F3 GENERATE *, (SKSearchEngineId IS NULL ? '-1' : SKSearchEngineId) as SKSearchEngineId;
119 --F3 = FOREACH F2 GENERATE *, NULL AS SKSearchEngineId, NULL AS SearchEngineId;
120
121 F4 = JOIN F3 BY browser LEFT, TBrowser BY BrowserId USING 'replicated' PARALLEL 20;
122 F4 = FOREACH F4 GENERATE *, (SKBrowserId IS NULL ? '-1' : SKBrowserId) as SKBrowserId;
123
124 --F4 = FOREACH F3 GENERATE *, NULL AS SKBrowserId, NULL AS BrowserId;
125
126
127 F5 = JOIN F4 BY os LEFT, TOperatingSystem BY OperatingSystemId USING 'replicated' PARALLEL 20;
128 F5 = FOREACH F5 GENERATE *, (SKOperatingSystemId IS NULL ? '-1' : SKOperatingSystemId) as SKOperatingSystemId;
129
130 --F5 = FOREACH F4 GENERATE *, NULL AS SKOperatingSystemId, NULL AS OperatingSystemId;
131
132
133 F6 = JOIN F5 BY resolution LEFT, TResolution BY ResolutionId USING 'replicated' PARALLEL 20;
134 F6 = FOREACH F6 GENERATE *, (SKResolutionId IS NULL ? '-1' : SKResolutionId) as SKResolutionId;
135
136 --F6 = FOREACH F5 GENERATE *, NULL AS SKResolutionId, NULL AS ResolutionId;
137
138 F7 = JOIN F6 BY post_evar4 LEFT, TSiteMap BY SiteMapNodeId USING 'replicated' PARALLEL 20;
139 F7 = FOREACH F7 GENERATE *, (SKSimteMapNodeId IS NULL ? '-1' : SKSimteMapNodeId) as SKSimteMapNodeId;
140
141 --F7 = FOREACH F6 GENERATE *, NULL AS SKSimteMapNodeId, NULL AS SiteMapNodeId;
142
143
144 SPLIT F7 INTO WITHOUT_CUSTOMER IF post_evar30 IS NULL, WITH_CUSTOMER IF post_evar30 IS NOT NULL;
145
146 F8 = JOIN WITH_CUSTOMER BY post_evar30 LEFT, TCustomer BY CustomerId USING 'skewed' PARALLEL 20;
147 WITHOUT_CUSTOMER = FOREACH WITHOUT_CUSTOMER GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
148
149 --F8_UNION = FOREACH F7 GENERATE *, NULL as SKCustomerId, NULL as CustomerId;
150 F8_UNION = UNION F8, WITHOUT_CUSTOMER;
151 --DESCRIBE F8;
152 --DESCRIBE WITHOUT_CUSTOMER;
153 --DESCRIBE F8_UNION;
154
155 F9 = FOREACH F8_UNION GENERATE
156 visid_high,
157 visid_low,
158 VisitId,
159 post_evar30,
160 SKCustomerId,
161 visit_num,
162 SkDateId,
163 SkTimeId,
164 post_evar16,
165 post_evar52,
166 visit_page_num,
167 is_202,
168 is_10,
169 is_12,
18. Reducer 2Mappers output
Reducer 1
Hive Joins
How to join with MapReduce ?
15/05/2013Dataiku - Innovation Services 19
tbl_idx uid name
1 1 Dupont
1 2 Durand
tbl_idx uid type
2 1 Type1
2 1 Type2
2 2 Type1
Shuffle by uid
Sort by (uid, tbl_idx)
Uid Tbl_idx Name Type
1 1 Dupont
1 2 Type1
1 2 Type2
Uid Tbl_idx Name Type
2 1 Durand
2 2 Type1
Uid Name Type
1 Dupont Type1
1 Dupont Type2
Uid Name Type
2 Durand Type1
19. Hadoop and Context (->0:03)
Pig, Hive, Cascading, … (->0:09)
How they work (->0:15)
Comparing the tools (->0:35)
Make them work together (->0:40)
Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
20. Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
21. Transformation as a
sequence of
operations
Transformation as a
set of formulas
Dataiku - Pig, Hive and Cascading
Procedural Vs Declarative
insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value > 0;
) using ipaddr
group by dma;
Users = load 'users' as (name, age, ipaddr);
Clicks = load 'clicks' as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by
user;
Geoinfo = load 'geoinfo' as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
22. All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}
Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing
Data type and Model
Rationale
Dataiku - Pig, Hive and Cascading
23. Hive
Data Type and Schema
5/15/2013 24
Simple type Details
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes
FLOAT, DOUBLE 4 and 8 bytes
BOOLEAN
STRING Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type Details
ARRAY Array of typed items (0-indexed)
MAP Associative map
STRUCT Complex class-like objects
Dataiku Training – Hadoop for Data Science
CREATE TABLE visit (
user_name STRING,
user_id INT,
user_details STRUCT<age:INT, zipcode:INT>
);
24. rel = LOAD '/folder/path/'
USING PigStorage(‘t’)
AS (col:type, col:type, col:type);
Data types and Schema
Pig
5/15/2013 25
Simple type Details
int, long, float,
double
32 and 64 bits, signed
chararray A string
bytearray An array of … bytes
boolean A boolean
Complex type Details
tuple a tuple is an ordered fieldname:value map
bag a bag is a set of tuples
Dataiku Training – Hadoop for Data Science
25. Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Data Type and Schema
Cascading
Dataiku - Pig, Hive and Cascading
Simple type Details
Int, Long, Float,
Double
32 and 64 bits, signed
String A string
byte[] An array of … bytes
Boolean A boolean
Complex type Details
Object Object must be « Hadoop serializable »
26. Style Summary
Dataiku - Pig, Hive and Cascading
Style Typing Data Model Metadata
store
Pig Procedural Static +
Dynamic
scalar +
tuple+ bag
(fully
recursive)
No
(HCatalog)
Hive Declarative Static +
Dynamic,
enforced at
execution
time
scalar+ list
+ map
Integrated
Cascading Procedural Weak scalar+ java
objects
No
27. Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
28. Does debugging
the tool lead to bad
headaches ?
Dataiku - Pig, Hive and Cascading
Headachility
Motivation
29. Out Of Memory Error (Reducer)
Exception in Building /
Extended Functions
(handling of null)
Null vs “”
Nested Foreach and scoping
Date Management (pig 0.10)
Field implicit ordering
Dataiku - Pig, Hive and Cascading
Headaches
Pig
31. Out of Memory Errors in
Reducers
Few Debugging Options
Null / “”
No builtin “first”
Dataiku - Pig, Hive and Cascading
Headaches
Hive
32. Weak Typing Errors (comparing
Int and String … )
Illegal Operation Sequence
(Group after group …)
Field Implicit Ordering
Dataiku - Pig, Hive and Cascading
Headaches
Cascading
33. How to perform unit tests ?
How to have different versions of the same script
(parameter) ?
Testing
Motivation
Dataiku - Pig, Hive and Cascading
34. System Variables
Comment to test
No Meta Programming
pig –x local to execute on local files
Testing
Pig
Dataiku - Pig, Hive and Cascading
35. Junit Tests are possible
Ability to use code to actually comment out some
variables
Testing / Environment
Cascading
Dataiku - Pig, Hive and Cascading
36. Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …
Checkpointing
Motivation
Dataiku - Pig, Hive and Cascading
Page User Correlation OutputFilteringParse Logs Per Page Stats
FIX and relaunch
37. STORE Command to manually
store files
Pig
Manual Checkpointing
Dataiku - Pig, Hive and Cascading
Page User Correlation OutputFilteringParse Logs Per Page Stats
// COMMENT Beginning
of script and relaunch
38. Ability to re-run a
flow automatically
from the last saved
checkpoint
Dataiku - Pig, Hive and Cascading
Cascading
Automated Checkpointing
addCheckpoint(…)
39. Check each file intermediate timestamp
Execute only if more recent
Dataiku - Pig, Hive and Cascading
Cascading
Topological Scheduler
Page User Correlation OutputFilteringParse Logs Per Page Stats
40. Productivity Summary
Dataiku - Pig, Hive and Cascading
Headaches Checkpointing/Re
play
Testing /
Metaprogrammation
Pig Lots Manual Save Difficult Meta
programming, easy
local testing
Hive Few, but
without
debugging
options
None (That’s SQL) None (That’s SQL)
Cascading Weak Typing
Complexity
Checkpointing
Partial Updates
Possible
41. Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
42. Ability to integrate different file formats
◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..
Ability to integrate with external data sources or
sink ( MongoDB, ElasticSearch, Database. …)
Formats Integration
Motivation
Dataiku - Pig, Hive and Cascading
Format Size on Disk (GB) HIVE Processing time (24 cores)
Text File, uncompressed 18.7 1m32s
1 Text File, Gzipped 3.89 6m23s
(no parallelization)
JSON compressed 7.89 2m42s
multiple text file gzipped 4.02 43s
Sequence File, Block, Gzip 5.32 1m18s
Text File, LZO Indexed 7.03 1m22s
Format impact on size and performance
43. Hive: Serde (Serialize-Deserializer)
Pig : Storage
Cascading: Tap
Format Integration
Dataiku - Pig, Hive and Cascading
44. No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦ By Date /apache_logs/dt=2013-01-23
◦ By Data center /apache_logs/dc=redbus01/…
◦ By Country
◦ …
◦ Or any combination of the above
Partitions
Motivation
Dataiku - Pig, Hive and Cascading
45. Hive Partitioning
Partitioned tables
5/15/2013 46
CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1
Dataiku Training – Hadoop for Data Science
INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=‘s1’)
SELECT * FROM event_tmp;
46. No Direct support for partition
Support for “Glob” Tap, to build read from files
using patterns
You can code your own custom or virtual
partition schemes
Cascading Partition
Dataiku - Pig, Hive and Cascading
50. Allow to call a cascading flow from a Spring Batch
Spring Batch
Cascading Integration
Dataiku - Pig, Hive and Cascading
No full Integration with Spring MessageSource or
MessageHandler yet (only for local flows)
51. Integration
Summary
Dataiku - Pig, Hive and Cascading
Partition/Increme
ntal Updates
External Code Format
Integration
Pig No Direct
Support
Simple Doable and rich
community
Hive Fully integrated,
SQL Like
Very simple, but
complex dev setup
Doable and
existing
community
Cascading With Coding Complex UDFS
but regular, and
Java Expression
embeddable
Doable and
growing
commuinty
52. Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Comparing without Comparable
Dataiku - Pig, Hive and Cascading
53. Several Common Map Reduce Optimization
Patterns
◦ Combiners
◦ MapJoin
◦ Job Fusion
◦ Job Parallelism
◦ Reducer Parallelism
Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write
Optimization
Dataiku - Pig, Hive and Cascading
54. SELECT date, COUNT(*) FROM product GROUP BY date
Combiner
Perform Partial Aggregate at Mapper Stage
Dataiku - Pig, Hive and Cascading
Map Reduce
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 20
2012-02-15 35
2012-02-16 1
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
55. SELECT date, COUNT(*) FROM product GROUP BY date
Combiner
Perform Partial Aggregate at Mapper Stage
Dataiku - Pig, Hive and Cascading
Map Reduce
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 12
2012-02-15 23
2012-02-16 1
2012-02-14 8
2012-02-15 12
2012-02-14 20
2012-02-15 35
2012-02-16 1
Reduced network bandwith. Better parallelism
56. Join Optimization
Map Join
Dataiku - Pig, Hive and Cascading
set hive.auto.convert.join = true;
Hive
Pig
Cascading
( no aggregation support after HashJoin)
57. Critical for performance
Estimated per the size of input file
◦ Hive
divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Number of Reducers
Dataiku - Pig, Hive and Cascading
59. Hadoop and Context (->0:03)
Pig, Hive, Cascading, … (->0:09)
How they work (->0:15)
Comparing the tools (->0:35)
Make them work together (->0:40)
Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
60. Follow the Flow
Dataiku - Pig, Hive and Cascading
Tracker Log
MongoDB
MySQL
MySQL
Syslog
Product
Catalog
Order
Apache Logs
Session
Product Transformation
Category Affinity
Category Targeting
Customer Profile
Product Recommender
S3
Search Logs (External) Search Engine
Optimization
(Internal) Search
Ranking
MongoDB
MySQL
Partner FTP
Sync In Sync Out
Pig
Pig
Hive
Hive
ElasticSearch
61. E.g. Product Recommender
Dataiku - Pig, Hive and Cascading
Page Views
Orders
Catalog
Bots, Special Users
Filtered Page Views
User Affinity
Product Popularity
User Similarity (Per Category)
Recommendation Graph
Recommendation
Order Summary
User Similarity (Per Brand)
Machine Learning
62. Schema Maintenance between tools
Proper incremental and efficient synchronization
between tools and NoSQL Store and Logs Systems
Proper “management” partition (daily jobs, …)
Job Sequence and Management
◦ How to handle properly a new field ? a missing data ?
recompute everything ?
Pain Points
On Large Projects
Dataiku - Pig, Hive and Cascading
63. Hcatalog provides an interoberability between Hive
and Pig in term of schema
Integration Option
HCatalog
Dataiku - Pig, Hive and Cascading
Hive Pig
HCatalog
65. Hadoop and Context (->0:03)
Pig, Hive, Cascading, … (->0:09)
How they work (->0:15)
Comparing the tools (->0:35)
Make them work together (->0:40)
Wrap’up and question (->Beer)
Agenda
Dataiku - Pig, Hive and Cascading
66. Want to keep close to SQL ?
◦ Hive
Want to write large flows ?
◦ Pig
Want to integrate in large scale programming
projects
◦ Cascading (cascalog / scalding)
Dataiku - Pig, Hive and Cascading
Presentation Available On
http://www.slideshare.net/Dataiku