What is Hadoop?

MK99 – Big Data 1
Big data
&
cross-platform analytics
MOOC lectures Pr. Clement Levallois

MK99 – Big Data 2
Focus on “Hadoop”
• Frequently mentioned in relation to big data
• Vague definitions available and inflated talks
• This short video will clarify it.

MK99 – Big Data 3
• Note on the terminology:
– “computers” are called “servers” when they are just used
for computing / processing / storing data
– They have no screen, no mouse and no keyboard because
that’s not needed.
– But they are basically computers!

MK99 – Big Data 4
“Hadoop”
• Created by Yahoo! engineers in ~ 2005. Named after the elephant toy of one of the
engineer’s kid.
• Made open source and now developed by the main open source developer
community, called “Apache”. So you can see sometimes “Apache Hadoop”.
• In simple words:
– Hadoop is a free, open source software.
– It serves to connect several servers, so that a single task can be accomplished in parallel on
them.
– So, with Hadoop and 5 servers you can get a task of data crunching finish 5 times sooner than
with if you had just used one server.
– That’s it!

MK99 – Big Data 5
Why are Hadoop, cloud computing and big data
often discussed together?
– Imagine that you are Walmart and want to compute something on your CRM: say, what are
the clients who are most profitable for each store, based on their purchase history.
– You will need many servers to store the data, and many servers to do the computations.
– Instead of purchasing a farm of servers for this (expensive! time consuming!), you can pay for
a service of cloud computing (such as Amazon AWS EC2) to rent servers just for this task,
– And install Hadoop on these servers to divide the task among all servers and get it to run in
parallel, speeding up computation times.
– You will get the results in minutes or hours, instead of days.

MK99 – Big Data 6
And map/reduce?
– “Map/reduce” is also an expression often discussed in relation with cloud computing and
Hadoop.
– This is a principle of programming perfected by engineers in Google around 2004, and made
open source.
– It is a principle that solves this problem: when I have data spread on 500 different servers,
how do I search some data on all the servers? Checking all servers one by one (sequential
search) would take a very long time. MapReduce dispatches the search on all servers at once,
hence it is 500 times quicker than a sequential search.
– Any software can use this principle of programming. Mapreduce is at the heart of Hadoop,
which is one of the most popular software using it.

MK99 – Big Data 7
What is the business relevance
of Hadoop?
• Hadoop made it possible to process large amounts of data quickly, using free
software.
• It enables business models where intensive data crunching is necessary to create
value.
• Examples:
– Amazon computing book recommendations for you,
– Walmart offering personalized coupons,
– NYT showing personalized display ads,
– Waze (driving app) showing the state of traffic on your road in real time,
– your electricity utility company computing how much electricity should be generated at peak
hours.

What is Hadoop?

Recommended

Recommended

More Related Content

More from Clement Levallois

More from Clement Levallois (7)

Recently uploaded

Recently uploaded (20)

What is Hadoop?