SlideShare a Scribd company logo
1 of 20
Download to read offline
Introduction to Hadoop and MapReduce
Colin Su, Tagtoo
Advertisement System Architecture (now)
Advertisement System Architecture (future)
• Grid
• Ad Server
• Data Highway
• Steaming Computing
Grid
• Core:
• Data mining
• Machine Learning
• Collecting data from users, logs and calculate out the strategy
• Sort our data in a proper form, them we could use it anytime

Data -> Information
Ad Server
• Ranking
• According the “information” in Grid, decide which AD should be advertised
• show proper ads to website visitors
Data Highway
• Transfer your data to the proper place
Stream Computing
• Core:
• logging
• feedback
• anti-cheating
• pricing
• post-process everything thrown out from Ad Server, and feedback useful information to Grid
• be the entrance of advertisement system
Hadoop
• an open-source software framework for data scientists
• derives from Google’s MapReduce and Google File System (GFS) papers
• written in Java
• could be divided in to 2 components:
• MapReduce
• HDFS (Hadoop distributed file system)
• a yellow elephant
Why Hadoop?
• moving computation is much cheaper and easier than moving data
• “Big Data”, the amount of data becomes too large, need a effective way to manage it
• so does computation
• high fault-tolerance
• developed by Yahoo!
MapReduce
• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster
• different from map/reduce, the conception of functional programming, but actually they have the same idea,
“divide and conquer”
• proposed by Google
Functional “map/reduce”
• map()/reduce() in Python
• map(function(elem), list) -> list
• reduce(function(elem1, elem2), list) -> single result
• e.g.
• map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8]
• reduce(lambda x,y: x+y, [1,2,3,4]) => 10
Parallel “MapReduce” 5 Steps
•

prepare the map() input for mappers

•

mappers run the map() code -> generated intermediate pairs

•

dispatch intermediate pairs to reducers

•

reducers run the reduce() code, aggregate the results

•

prepare output from the result of reduce()
Example of “MapReduce” Word Count

map()

reduce()
Example of “MapReduce” Word Count
• Original Input

Apple Orange Mongo
Orange Grapes Plum
...
Example of “MapReduce” Word Count
• Prepare data for mappers

Apple Orange Mongo

Orange Grapes Plum

...
Example of “MapReduce” Word Count
• map() to useful record

(Apple, 1)

Apple Orange Mongo

(Orange, 1)

(Mongo, 1)

Intermediate key/value pair
Example of “MapReduce” Word Count
• sort and shuffle
(Apple, 1)
(Mongo, 1)

(Apple, 1)

(Orange, 1)

Reducer

(Apple, 1)

(Mongo, 1)

(Apple, 1)

(Orange, 1)

(Orange, 1)
Shuffle to Reducers

(Orange, 1)

(Orange, 1)

(Apple, 1)

(Mongo, 1)

(Apple, 1)

(Mongo, 1)

unsorted

Sorted

(Orange, 1)
Reducer

(Mongo, 1)
(Mongo, 1)
Reducer
Example of “MapReduce” Word Count
• Reduce()

(Apple, 1)
(Apple, 1)

(Apple, 2)

Reducer

(Orange, 1)
(Orange, 1)
(Orange, 1)
Reducer

(Orange, 3)
Example of “MapReduce” Word Count
• Generate Output

(Apple, 2)

(Orange, 3)

(Grapes, 1)

(Plum, 5)

Apple 2
Orange 3
Grapes 1
Plum 5
WordCount.txt
Hadoop Infrastructure
• Pig: Programming Language for MapReduce
• Thrift: cross-language communication, just like Google’s ProtoBuffer
• Zookeeper: cluster management

Hadoop

Hadoop

Other Services

Thrift

MapReduce
Pig

Hadoop

HDFS

Hadoop

Hadoop
ZooKeeper

More Related Content

What's hot

Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 

What's hot (20)

Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
EMR AWS Demo
EMR AWS DemoEMR AWS Demo
EMR AWS Demo
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 

Viewers also liked

Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Edureka!
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

Viewers also liked (18)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to Introduction to MapReduce & hadoop

Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Similar to Introduction to MapReduce & hadoop (20)

Social Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDBSocial Data and Log Analysis Using MongoDB
Social Data and Log Analysis Using MongoDB
 
Mapreduce
MapreduceMapreduce
Mapreduce
 
Hadoop
HadoopHadoop
Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop
HadoopHadoop
Hadoop
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 

More from Colin Su

Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
Colin Su
 
Functional programming in Python
Functional programming in PythonFunctional programming in Python
Functional programming in Python
Colin Su
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
Colin Su
 
How to Speak Charms Like a Wizard
How to Speak Charms Like a WizardHow to Speak Charms Like a Wizard
How to Speak Charms Like a Wizard
Colin Su
 
房地產報告
房地產報告房地產報告
房地產報告
Colin Su
 
Facebook Python SDK - Introduction
Facebook Python SDK - IntroductionFacebook Python SDK - Introduction
Facebook Python SDK - Introduction
Colin Su
 

More from Colin Su (20)

Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
 
Introduction to Google Cloud Endpoints: Speed Up Your API Development
Introduction to Google Cloud Endpoints: Speed Up Your API DevelopmentIntroduction to Google Cloud Endpoints: Speed Up Your API Development
Introduction to Google Cloud Endpoints: Speed Up Your API Development
 
Functional programming in Python
Functional programming in PythonFunctional programming in Python
Functional programming in Python
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
 
A Tour of Google Cloud Platform
A Tour of Google Cloud PlatformA Tour of Google Cloud Platform
A Tour of Google Cloud Platform
 
Introduction to Facebook JavaScript & Python SDK
Introduction to Facebook JavaScript & Python SDKIntroduction to Facebook JavaScript & Python SDK
Introduction to Facebook JavaScript & Python SDK
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Engine
 
Django Deployer
Django DeployerDjango Deployer
Django Deployer
 
Introduction to Google - the most natural way to learn English (English Speech)
Introduction to Google - the most natural way to learn English (English Speech)Introduction to Google - the most natural way to learn English (English Speech)
Introduction to Google - the most natural way to learn English (English Speech)
 
How to Speak Charms Like a Wizard
How to Speak Charms Like a WizardHow to Speak Charms Like a Wizard
How to Speak Charms Like a Wizard
 
房地產報告
房地產報告房地產報告
房地產報告
 
Introduction to Git
Introduction to GitIntroduction to Git
Introduction to Git
 
Introduction to Facebook Javascript SDK (NEW)
Introduction to Facebook Javascript SDK (NEW)Introduction to Facebook Javascript SDK (NEW)
Introduction to Facebook Javascript SDK (NEW)
 
Introduction to Facebook Python API
Introduction to Facebook Python APIIntroduction to Facebook Python API
Introduction to Facebook Python API
 
Facebook Python SDK - Introduction
Facebook Python SDK - IntroductionFacebook Python SDK - Introduction
Facebook Python SDK - Introduction
 
Web Programming - 1st TA Session
Web Programming - 1st TA SessionWeb Programming - 1st TA Session
Web Programming - 1st TA Session
 
Nested List Comprehension and Binary Search
Nested List Comprehension and Binary SearchNested List Comprehension and Binary Search
Nested List Comprehension and Binary Search
 
Python-List comprehension
Python-List comprehensionPython-List comprehension
Python-List comprehension
 
Python-FileIO
Python-FileIOPython-FileIO
Python-FileIO
 
Python Dictionary
Python DictionaryPython Dictionary
Python Dictionary
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Introduction to MapReduce & hadoop

  • 1. Introduction to Hadoop and MapReduce Colin Su, Tagtoo
  • 3. Advertisement System Architecture (future) • Grid • Ad Server • Data Highway • Steaming Computing
  • 4. Grid • Core: • Data mining • Machine Learning • Collecting data from users, logs and calculate out the strategy • Sort our data in a proper form, them we could use it anytime Data -> Information
  • 5. Ad Server • Ranking • According the “information” in Grid, decide which AD should be advertised • show proper ads to website visitors
  • 6. Data Highway • Transfer your data to the proper place
  • 7. Stream Computing • Core: • logging • feedback • anti-cheating • pricing • post-process everything thrown out from Ad Server, and feedback useful information to Grid • be the entrance of advertisement system
  • 8. Hadoop • an open-source software framework for data scientists • derives from Google’s MapReduce and Google File System (GFS) papers • written in Java • could be divided in to 2 components: • MapReduce • HDFS (Hadoop distributed file system) • a yellow elephant
  • 9. Why Hadoop? • moving computation is much cheaper and easier than moving data • “Big Data”, the amount of data becomes too large, need a effective way to manage it • so does computation • high fault-tolerance • developed by Yahoo!
  • 10. MapReduce • a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster • different from map/reduce, the conception of functional programming, but actually they have the same idea, “divide and conquer” • proposed by Google
  • 11. Functional “map/reduce” • map()/reduce() in Python • map(function(elem), list) -> list • reduce(function(elem1, elem2), list) -> single result • e.g. • map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8] • reduce(lambda x,y: x+y, [1,2,3,4]) => 10
  • 12. Parallel “MapReduce” 5 Steps • prepare the map() input for mappers • mappers run the map() code -> generated intermediate pairs • dispatch intermediate pairs to reducers • reducers run the reduce() code, aggregate the results • prepare output from the result of reduce()
  • 13. Example of “MapReduce” Word Count map() reduce()
  • 14. Example of “MapReduce” Word Count • Original Input Apple Orange Mongo Orange Grapes Plum ...
  • 15. Example of “MapReduce” Word Count • Prepare data for mappers Apple Orange Mongo Orange Grapes Plum ...
  • 16. Example of “MapReduce” Word Count • map() to useful record (Apple, 1) Apple Orange Mongo (Orange, 1) (Mongo, 1) Intermediate key/value pair
  • 17. Example of “MapReduce” Word Count • sort and shuffle (Apple, 1) (Mongo, 1) (Apple, 1) (Orange, 1) Reducer (Apple, 1) (Mongo, 1) (Apple, 1) (Orange, 1) (Orange, 1) Shuffle to Reducers (Orange, 1) (Orange, 1) (Apple, 1) (Mongo, 1) (Apple, 1) (Mongo, 1) unsorted Sorted (Orange, 1) Reducer (Mongo, 1) (Mongo, 1) Reducer
  • 18. Example of “MapReduce” Word Count • Reduce() (Apple, 1) (Apple, 1) (Apple, 2) Reducer (Orange, 1) (Orange, 1) (Orange, 1) Reducer (Orange, 3)
  • 19. Example of “MapReduce” Word Count • Generate Output (Apple, 2) (Orange, 3) (Grapes, 1) (Plum, 5) Apple 2 Orange 3 Grapes 1 Plum 5 WordCount.txt
  • 20. Hadoop Infrastructure • Pig: Programming Language for MapReduce • Thrift: cross-language communication, just like Google’s ProtoBuffer • Zookeeper: cluster management Hadoop Hadoop Other Services Thrift MapReduce Pig Hadoop HDFS Hadoop Hadoop ZooKeeper