Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics

•Download as PPTX, PDF•

5 likes•2,949 views

This document discusses the rise of the Hadoop ecosystem. It outlines how the ecosystem has expanded from the original Hadoop components of HDFS for storage and MapReduce for distributed computation. New frameworks have emerged that allow for real-time queries, updates, and machine learning on big data. These include Spark, Storm, Drill, and streaming engines. The ecosystem is now a complex network of interoperable tools for storage, computation, analytics and machine learning on large datasets.

Technology Business

Florian Douetteau
CEO Dataiku
DATA PREPARATION
MODELING
STATISTICS
VISUALIZATION
ALL-IN-ONE
DATA SCIENCE STUDIO

DRIVERS FOR
THE NEW “REAL-TIME“
HADOOP ECOSYSTEM
KEY TOOLS AND
FRAMEWORKS
TO BE AWARE OF

2000 2013
1000$ / GB
6$ / GB
$10 / GB
$0.06 / GB
memory
divided by 150
disk cost
divided by 250
MAP
REDUCE
times
HACK
REDUCE
times

Web Site
– $1B revenue per year
– 10 Millions Unique Visitor per month
– 100.Millions orders / actions / per day
10TB
RAW DATA
1TB
REFINE DATA

• GOOGLE
• 1 Circle
OPEN SOURCE
– YAHOO – IBM –
LINKEDIN - FACEBOOK
• 2 Circle
– STANDFORD
BERKELEY
– STARTUPS

64m$
6.75m$
14m$
2m$
40m$
20m$
20.5m$
19m$
4m$
100m$
1.8m$
17m$
11m$
7.75m$
1.7m$
2013
2012
2011
2010
2009
$1B per year
Invested
in Big Data
TECH
223m$
301m$

HDFS
MAP
REDUCE
1. Safe Large Storage (HDFS)
2. Distributed computation
paradigm (Map Reduce)
3. Resilient long job
1. Disk-CPU locality aware
resource allocation
HADOOP =

HDFS
YARN
map
reduce
provider
1
Other cluster
provider
…
THE NEW ECOSYSTEM

REAL-TIME
QUERIES
REAL-TIME
UPDATES
FAST
MACHINE LEARNING

MPP Database like performance
for Hadoop
- Created in 2012 by Cloudera
- x100 performance over Hive
(for certain queries)

Extensible architecture
for SQL Querying
• Started in 2013
• Apache Incubated Project
• Lucidworks
• Mapr
• ElasticSearch
• …
• Alpha Status
• Open architecture for supporting
SQL like queries to various data
sources:
• Cassandra
• MongoDB
• HDFS
• HBase
Apache DRILL

Update the Model Once per week
using the whole history
Apply the model for each user
using the very last events
Real-Time
Navigation
Real-Time
Recommendation

STORM Reliable Distributed
Real-Time Computations
- Connect to a variety of data
sources (HDFS, RabbitMQ, JMS etc..)
- Run Computation in java (native) or
python, ruby, perl …
- Guarantees that events are taken
processed
- Distributes workload

Write Map-Reduce like program
and executing either in
• Batch
• Real-Time
• Hybrid Batch / Real-Time
• Open Sourced By Twitter in 2013
• Built on top of Storm (and Cascading)
• Program in Scala

……..
……..
Stochastic Gradient Descent : ITERATE
K-Means : ITERATE
Pages Rank: ITERATE
……..

“Graph” Analytics in Memory
• Created at Carnegie-Mellon in 2009
• Generic Graph Traversal framework
• Packaged Machine Learning
- Recommender Systems
- Graph Analytics
- Clustering
• Easy Python Integration

In-Memory Distribution
Prediction Engine
Machine Learning
- Classification
- Regression
- Clustering
- R/Python easy
integration

Real-Time Resilient
Distributed Memory
Framework
• Abstraction with any
DAG operation on
data:
- Filter
- Map
- Reduce
- Cache

SHARK
MLBASE
STREAMING
Real-Time Queries
Real-Time Updates
In-Memory Learning
SPARK

HDFS
YARN
map reduce SPARK
GRAPHLAB
H2OSTREAMING
MLBASE
SHARK
PIG
HIVE
CASCADING
STORM
DRILL
otherstorage
IMPALA

dataiku.com
DATAIKU STAND A4
DEMO
DATA SCIENCE STUDIO
Questions now
or later
florian.douetteau@dataiku.com

What's hot

Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz

1 content optimization-hug-2010-07-21Hadoop User Group

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni

Hadoop and Distributed ComputingFederico Cargnelutti

Another Intro To HadoopAdeel Ahmad

Introduction to Hadoop - The EssentialsFadi Yousuf

Apache Hadoop at 10Cloudera, Inc.

Nextag talkJoydeep Sen Sarma

HadoopKartik Kalpande Patil

Seminar Presentation HadoopVarun Narang

Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd

Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal

Hadoop PrimerSteve Staso

Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks

The Fundamentals Guide to HDP and HDInsightGert Drapers

Real Time and Big Data – It’s About TimeMapR Technologies

Migrating structured data between Hadoop and RDBMSBouquet

Hadoop and big dataSharad Pandey

Hadoop at EbayAroop Maliakkal

Hadoop And Their Ecosystemsunera pathan

What's hot (20)

Introduction to the Hadoop Ecosystem (FrOSCon Edition)

1 content optimization-hug-2010-07-21

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Hadoop and Distributed Computing

Another Intro To Hadoop

Introduction to Hadoop - The Essentials

Apache Hadoop at 10

Nextag talk

Hadoop

Seminar Presentation Hadoop

Hadoop Tutorial For Beginners

Introduction to Big Data & Hadoop Architecture - Module 1

Hadoop Primer

Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks

The Fundamentals Guide to HDP and HDInsight

Real Time and Big Data – It’s About Time

Migrating structured data between Hadoop and RDBMS

Hadoop and big data

Hadoop at Ebay

Hadoop And Their Ecosystem

Similar to Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati

Hadoop and MapReduceHemanth Kumar Mantri

Stream Processing and Real-Time Data PipelinesVladimír Schreiner

Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit

Big Data Analysis : Deciphering the haystack Srinath Perera

Taboola Road To Scale With Apache Sparktsliwowicz

Big dataroysonli

AWS (Hadoop) Meetup 30.04.09Chris Purrington

Hadoop HDFS.ppt6535ANURAGANURAG

Bigdata processing with SparkArjen de Vries

Final deckSteve Watt

Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz

Intro to Apache Spark by CTO of TwingoMapR Technologies

Data analytics & its TrendsDr.K.Sreenivas Rao

JDD2014: Real Big Data - Scott MacGregorPROIDEA

Hadoop Master Class : A concise overviewAbhishek Roy

Hadoop - Introduction to HDFSVibrant Technologies & Computers

Similar to Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics (20)

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

Machine Learning for Smarter Apps - Jacksonville Meetup

Hadoop and MapReduce

Stream Processing and Real-Time Data Pipelines

Google Cloud Dataflow Two Worlds Become a Much Better One

Big Data Analysis : Deciphering the haystack

Taboola Road To Scale With Apache Spark

Big data

AWS (Hadoop) Meetup 30.04.09

Hadoop HDFS.ppt

Bigdata processing with Spark

Final deck

Taboola's experience with Apache Spark (presentation @ Reversim 2014)

Intro to Apache Spark by CTO of Twingo

Data analytics & its Trends

JDD2014: Real Big Data - Scott MacGregor

Hadoop Master Class : A concise overview

Hadoop - Introduction to HDFS

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Search Engine Optimization SEO PDF for 2024.pdfRankYa

CloudStudio User manual (basic edition):comworks

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

From Family Reminiscence to Scholarly Archive .Alan Dix

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Story boards and shot lists for my a level piececharlottematthew16

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Designing IA for AI - Information Architecture Conference 2024

Take control of your SAP testing with UiPath Test Suite

Anypoint Exchange: It’s Not Just a Repo!

DSPy a system for AI to Write Prompts and Do Fine Tuning

Search Engine Optimization SEO PDF for 2024.pdf

CloudStudio User manual (basic edition):

"Debugging python applications inside k8s environment", Andrii Soldatenko

DevEX - reference for building teams, processes, and platforms

Commit 2024 - Secret Management made easy

Streamlining Python Development: A Guide to a Modern Project Setup

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

From Family Reminiscence to Scholarly Archive .

What's New in Teams Calling, Meetings and Devices March 2024

Ensuring Technical Readiness For Copilot in Microsoft 365

Human Factors of XR: Using Human Factors to Design XR Systems

Powerpoint exploring the locations used in television show Time Clash

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Story boards and shot lists for my a level piece

Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics

1. The Rise of the Hadoop Ecosystem

2. Florian Douetteau CEO Dataiku DATA PREPARATION MODELING STATISTICS VISUALIZATION ALL-IN-ONE DATA SCIENCE STUDIO

3. DRIVERS FOR THE NEW “REAL-TIME“ HADOOP ECOSYSTEM KEY TOOLS AND FRAMEWORKS TO BE AWARE OF

4. RAM - CPU - DISK

5. 2000 2013 1000$ / GB 6$ / GB $10 / GB $0.06 / GB memory divided by 150 disk cost divided by 250 MAP REDUCE times HACK REDUCE times

7. WHOLE DATA REFINED DATA

8. NEEDLE IN HAYSTACK ?

9. REFINE BEFORE USE

10. Web Site – $1B revenue per year – 10 Millions Unique Visitor per month – 100.Millions orders / actions / per day 10TB RAW DATA 1TB REFINE DATA

11. FITS IN MEMORY 1TB

12. • GOOGLE • 1 Circle OPEN SOURCE – YAHOO – IBM – LINKEDIN - FACEBOOK • 2 Circle – STANDFORD BERKELEY – STARTUPS

13. 64m$ 6.75m$ 14m$ 2m$ 40m$ 20m$ 20.5m$ 19m$ 4m$ 100m$ 1.8m$ 17m$ 11m$ 7.75m$ 1.7m$ 2013 2012 2011 2010 2009 $1B per year Invested in Big Data TECH 223m$ 301m$

14.

15. HDFS MAP REDUCE 1. Safe Large Storage (HDFS) 2. Distributed computation paradigm (Map Reduce) 3. Resilient long job 1. Disk-CPU locality aware resource allocation HADOOP =

16. LOVELY TANGLED TOGETHER

17.

18. HDFS YARN map reduce provider 1 Other cluster provider … THE NEW ECOSYSTEM

19. REALLY FASTER ?

20. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING

21. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING

22. DEVELOPPER CAN WAIT

23. BUSINESS WON’T WAIT

24. Not All Queries are born equals

25. MPP Database like performance for Hadoop - Created in 2012 by Cloudera - x100 performance over Hive (for certain queries)

26. Extensible architecture for SQL Querying • Started in 2013 • Apache Incubated Project • Lucidworks • Mapr • ElasticSearch • … • Alpha Status • Open architecture for supporting SQL like queries to various data sources: • Cassandra • MongoDB • HDFS • HBase Apache DRILL

27. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING

28.

29. Update the Model Once per week using the whole history Apply the model for each user using the very last events Real-Time Navigation Real-Time Recommendation

30. STORM Reliable Distributed Real-Time Computations - Connect to a variety of data sources (HDFS, RabbitMQ, JMS etc..) - Run Computation in java (native) or python, ruby, perl … - Guarantees that events are taken processed - Distributes workload

31. Write Map-Reduce like program and executing either in • Batch • Real-Time • Hybrid Batch / Real-Time • Open Sourced By Twitter in 2013 • Built on top of Storm (and Cascading) • Program in Scala

32. REAL-TIME QUERIES REAL-TIME UPDATES FAST MACHINE LEARNING

33. GOOD PUPILS ITERATE

34. …….. …….. Stochastic Gradient Descent : ITERATE K-Means : ITERATE Pages Rank: ITERATE ……..

35. “Graph” Analytics in Memory • Created at Carnegie-Mellon in 2009 • Generic Graph Traversal framework • Packaged Machine Learning - Recommender Systems - Graph Analytics - Clustering • Easy Python Integration

36. In-Memory Distribution Prediction Engine Machine Learning - Classification - Regression - Clustering - R/Python easy integration

37. Real-Time Resilient Distributed Memory Framework • Abstraction with any DAG operation on data: - Filter - Map - Reduce - Cache

38. SHARK MLBASE STREAMING Real-Time Queries Real-Time Updates In-Memory Learning SPARK

39. HDFS YARN map reduce SPARK GRAPHLAB H2OSTREAMING MLBASE SHARK PIG HIVE CASCADING STORM DRILL otherstorage IMPALA

40. dataiku.com DATAIKU STAND A4 DEMO DATA SCIENCE STUDIO Questions now or later florian.douetteau@dataiku.com

Editor's Notes

EVERYTHING IS ABOUT PRICE / PERFORMANCE RATIO OF MEMORY CPU DISK

Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics

Similar to Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics (20)

More from Dataiku

More from Dataiku (20)

Recently uploaded

Recently uploaded (20)

Rise of the Hadoop Ecosystem and Key Tools for Real-Time Analytics

Editor's Notes