Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

•

2 likes•828 views

Turi, Inc.

Presented by Guy Rapaport

Technology

Dato Confidential1
GraphLab Create Benchmarks
April 21, 2016
Guy Rapaport, Data Scientist, Dato EMEA
guy@dato.com

Dato Confidential2
Dato: We Intelligent Applications

Dato Confidential
Some of our Customers
3

Dato Confidential4
Business
must be intelligent
Machine learning
applications
• Recommenders
• Fraud detection
• Ad targeting
• Financial models
• Personalized medicine
• Churn prediction
• Smart UX
(video & text)
• Personal assistants
• IoT
• Socials networks
• Log analysis
Last decade:
Data management
Now:
Intelligent apps
?
Last 5 years:
Traditional analytics

Dato Confidential
Example Intelligent Applications
- images
- text
- graphs
- tabular data
5

Dato Confidential
Creating a model pipeline
exploration
data
modeling

Dato Confidential
Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data

Dato Confidential9
GraphLab Create in a Line
“A general-purpose machine learning Python library that
scales on large datasets.”
 General purpose: classification, graph analytics…
 Python API on top, C++ open-source engine below.
 Scales vertically: more CPUs, RAM and faster disks.
 Large datasets: disk bound, not RAM bound.
9

Dato Confidential10
What will we cover today?
1. Instantiating a machine in the Amazon EC2 cloud
• r3.8xlarge instance
• 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each
2. Run PageRank on a large graph
• CommonCrawl 2012 dataset – the internet as a graph
• 3.5 billion nodes, 128 billion links
3. Run Gradient Boosted Trees on a large dataset
• Criteo 1TB Click Logs Dataset
• 4.3 billion rows, 39 features (13 numerical, 26 categorical)
10

Dato Confidential11
What will you be able to do afterwards?
Instantiate an EC2 instance, grab our benchmark
notebooks, and try it yourself!
Everything is publicly available on github:
https://github.com/guy4261/glc_pagerank_benchmark
11

$Dato Confidential12 Screen Primer Command Action sudo apt-get install –y screen Install screen screen –S my_session Start a session named my_session PS1=‘u@h(${STY}:${WINDOW}):w$’ Change your screen prompt (helpful) # CTRL+A, then D Key combination to detach screen -ls List all open screens screen –r my_session Reattach to your screen exit Exit the session and terminate the screen$

Dato Confidential
Confidential – Dato internal use only. ©2015 Dato, Inc.
Questions?
“For the purpose of learning the Answer to the
Ultimate Question of Life, The Universe, and Everything,
the supercomputer Deep Thought was specially built.
It takes Deep Thought 7½ million years to compute and check the
answer, which turns out to be 42. Deep Thought points out that
the answer seems meaningless because
the beings who instructed it
never actually knew what the Question was.”
- Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”

Dato Confidential14
Our Machine Learning Specialization
in Coursera
https://www.coursera.org/learn/ml-foundations

Dato Confidential
Confidential – Dato internal use only. ©2015 Dato, Inc.
Thanks!
Install using pip: $ pip install -U graphlab-create
Dato Launcher Download:
https://dato.com/download/
The benchmarks on GitHub:
https://github.com/guy4261/glc_pagerank_benchmark
Coursera Course:
https://www.coursera.org/learn/ml-foundations
Reach out: guy@dato.com

What's hot

Real time analytics @ netflixCody Rioux

Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Sri Ambati

Scaling up Deep Learning by Scaling DownDatabricks

Predicting Medical Test Results using Driverless AISri Ambati

Saving Human Lives with the IoTDat Tran

Intelligent Applications with Machine Learning ToolkitsTuri, Inc.

Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA

Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan

UX Analytics for Data-driven Product DevelopmentTrieu Nguyen

Why more than half of ML models don't make it to productioncnvrg.io AI OS - Hands-on ML Workshops

The More the Merrier: Scaling Model Building Infrastructure at ZendeskDatabricks

Machine Learning In ProductionSamir Bessalah

DATA @ NFLX (Tableau Conference 2014 Presentation)Blake Irvine

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati

DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...Dataconomy Media

BigML Webcast: September 25, 2013BigML, Inc

Better Customer Experience with Data Science - Bernard Burg, ComcastSri Ambati

Machine Learning InfrastructureSigOpt

The Past, Present, and Future of Machine Learning APIsBigML, Inc

What's hot (20)

Real time analytics @ netflix

Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018

Scaling up Deep Learning by Scaling Down

Predicting Medical Test Results using Driverless AI

Saving Human Lives with the IoT

Intelligent Applications with Machine Learning Toolkits

Lambda Architecture 2.0 for Reactive AB Testing

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...

Introduction to Distributed Computing Engines for Data Processing - Simone Ro...

UX Analytics for Data-driven Product Development

Why more than half of ML models don't make it to production

The More the Merrier: Scaling Model Building Infrastructure at Zendesk

Machine Learning In Production

DATA @ NFLX (Tableau Conference 2014 Presentation)

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...

DN18 | The Evolution and Future of Graph Technology: Intelligent Systems | Ax...

BigML Webcast: September 25, 2013

Better Customer Experience with Data Science - Bernard Burg, Comcast

Machine Learning Infrastructure

The Past, Present, and Future of Machine Learning APIs

Viewers also liked

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

Hadoop Graph Processing with Apache GiraphDataWorks Summit

HPE Keynote Hadoop Summit San Jose 2016DataWorks Summit/Hadoop Summit

Machine Learning with GraphLab CreateTuri, Inc.

Introducing Apache Giraph for Large Scale Graph Processingsscdotopen

Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr

Time Series Analysis with SparkSandy Ryza

Kudu - Fast Analytics on Fast DataRyan Bosshart

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy

Apache kuduAsim Jalis

Introduction to Apache KuduJeff Holoman

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney

Viewers also liked (14)

Apache Arrow (Strata-Hadoop World San Jose 2016)

Hadoop Graph Processing with Apache Giraph

HPE Keynote Hadoop Summit San Jose 2016

Machine Learning with GraphLab Create

Introducing Apache Giraph for Large Scale Graph Processing

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

Time Series Analysis with Spark

Kudu - Fast Analytics on Fast Data

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data

Apache kudu

Introduction to Apache Kudu

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

Efficient Data Storage for Analytics with Apache Parquet 2.0

Next-generation Python Big Data Tools, powered by Apache Arrow

Similar to Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software

Deskdoo.com How we re:invented the operating system in the cloudAdam Adamczyk

Danny Bickson - Python based predictive analytics with GraphLab Create PyData

Why Pay for Open Source Linux? Avoid the Hidden Cost of DIYEnterprise Management Associates

AI 2023.pdfDavidCieslak4

Open RacingKeith Curtis

hari_duche_updatedHari Duche

Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software

Developer's Introduction to Machine LearningChristopher Mohritz

Dublin Unity User Group Meetup Sept 2015Dominique Boutin

Open source for_you__july_2017Bình Phan

Google cloud Study Jam 2023.pptxGDSCNiT

Intel 20180608 v2ISSIP

Path to continuous deliveryAnirudh Bhatnagar

ATLUG comes to you ICS.UG 2015ICS User Group

Data science tools of the tradeFangda Wang

BigData Meets the Federal Data CenterAbe Usher

LJC-Unconference-2023-Keynote.pdfEmilyJiang23

Android task manager project presentationAkhilesh Jaiswal

Demystifying Machine Learning - How to give your business superpowers.10x Nation

Similar to Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets (20)

Igniting Next Level Productivity with AI-Infused Data Integration Workflows

Deskdoo.com How we re:invented the operating system in the cloud

Danny Bickson - Python based predictive analytics with GraphLab Create

Why Pay for Open Source Linux? Avoid the Hidden Cost of DIY

AI 2023.pdf

Open Racing

hari_duche_updated

Igniting Next Level Productivity with AI-Infused Data Integration Workflows

Developer's Introduction to Machine Learning

Dublin Unity User Group Meetup Sept 2015

Open source for_you__july_2017

Google cloud Study Jam 2023.pptx

Intel 20180608 v2

Path to continuous delivery

ATLUG comes to you ICS.UG 2015

Data science tools of the trade

BigData Meets the Federal Data Center

LJC-Unconference-2023-Keynote.pdf

Android task manager project presentation

Demystifying Machine Learning - How to give your business superpowers.

Recently uploaded

A Framework for Development in the AI AgeCprime

Scale your database traffic with Read & Write split using MySQL RouterMydbops

All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple

Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech

Top 10 Hubspot Development Companies in 2024TopCSSGallery

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC

Connecting the Dots for Information Discovery.pdfNeo4j

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

A Journey Into the Emotions of Software DevelopersNicole Novielli

Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma

A Glance At The Java Performance ToolboxAna-Maria Mihalceanu

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Recently uploaded (20)

A Framework for Development in the AI Age

Scale your database traffic with Read & Write split using MySQL Router

All These Sophisticated Attacks, Can We Really Detect Them - PDF

[Webinar] SpiraTest - Setting New Standards in Quality Assurance

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

Decarbonising Buildings: Making a net-zero built environment a reality

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...

Generative AI - Gitex v1Generative AI - Gitex v1.pptx

Potential of AI (Generative AI) in Business: Learnings and Insights

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

React Native vs Ionic - The Best Mobile App Framework

Top 10 Hubspot Development Companies in 2024

Generative Artificial Intelligence: How generative AI works.pdf

Landscape Catalogue 2024 Australia-1.pdf

Connecting the Dots for Information Discovery.pdf

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

A Journey Into the Emotions of Software Developers

Microservices, Docker deploy and Microservices source code in C#

A Glance At The Java Performance Toolbox

The State of Passkeys with FIDO Alliance.pptx

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

1. Dato Confidential1 GraphLab Create Benchmarks April 21, 2016 Guy Rapaport, Data Scientist, Dato EMEA guy@dato.com

2. Dato Confidential2 Dato: We Intelligent Applications

3. Dato Confidential Some of our Customers 3

4. Dato Confidential4 Business must be intelligent Machine learning applications • Recommenders • Fraud detection • Ad targeting • Financial models • Personalized medicine • Churn prediction • Smart UX (video & text) • Personal assistants • IoT • Socials networks • Log analysis Last decade: Data management Now: Intelligent apps ? Last 5 years: Traditional analytics

5. Dato Confidential Example Intelligent Applications - images - text - graphs - tabular data 5

6. Dato Confidential Creating a model pipeline exploration data modeling

7. Dato Confidential Creating a model pipeline Ingest Transform Model Deploy Unstructured Data

8. Dato Confidential

9. Dato Confidential9 GraphLab Create in a Line “A general-purpose machine learning Python library that scales on large datasets.”  General purpose: classification, graph analytics…  Python API on top, C++ open-source engine below.  Scales vertically: more CPUs, RAM and faster disks.  Large datasets: disk bound, not RAM bound. 9

10. Dato Confidential10 What will we cover today? 1. Instantiating a machine in the Amazon EC2 cloud • r3.8xlarge instance • 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each 2. Run PageRank on a large graph • CommonCrawl 2012 dataset – the internet as a graph • 3.5 billion nodes, 128 billion links 3. Run Gradient Boosted Trees on a large dataset • Criteo 1TB Click Logs Dataset • 4.3 billion rows, 39 features (13 numerical, 26 categorical) 10

11. Dato Confidential11 What will you be able to do afterwards? Instantiate an EC2 instance, grab our benchmark notebooks, and try it yourself! Everything is publicly available on github: https://github.com/guy4261/glc_pagerank_benchmark 11

12. Dato Confidential12 Screen Primer Command Action sudo apt-get install –y screen Install screen screen –S my_session Start a session named my_session PS1=‘u@h(${STY}:${WINDOW}):w$’ Change your screen prompt (helpful) # CTRL+A, then D Key combination to detach screen -ls List all open screens screen –r my_session Reattach to your screen exit Exit the session and terminate the screen

13. Dato Confidential Confidential – Dato internal use only. ©2015 Dato, Inc. Questions? “For the purpose of learning the Answer to the Ultimate Question of Life, The Universe, and Everything, the supercomputer Deep Thought was specially built. It takes Deep Thought 7½ million years to compute and check the answer, which turns out to be 42. Deep Thought points out that the answer seems meaningless because the beings who instructed it never actually knew what the Question was.” - Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”

14. Dato Confidential14 Our Machine Learning Specialization in Coursera https://www.coursera.org/learn/ml-foundations

15. Dato Confidential Confidential – Dato internal use only. ©2015 Dato, Inc. Thanks! Install using pip: $ pip install -U graphlab-create Dato Launcher Download: https://dato.com/download/ The benchmarks on GitHub: https://github.com/guy4261/glc_pagerank_benchmark Coursera Course: https://www.coursera.org/learn/ml-foundations Reach out: guy@dato.com

Editor's Notes

 The team, the history of the product
Company began 7 years ago in Carnegie Mellon University as an open-source project. Now a company with 50+ employees and a recently opened EMEA office here in Israel. Customers 
Yes, we are selling  (100+ paying customers, brand names)  Intelligent apps are predictive
From analytics (queries over known data) to predictive (discovering the unknown). Supported data types 
Creating a model pipeline 
Steps in the model pipeline creation 
From inspiration to production 
# end of corporate slides GLC in a line 
We’re gonna see it all today 
Really 3,443,082,324 vertices, 128,736,914,167 edges in CommonCrawl 2012. Really 4,373,472,329 rows in Criteo. Afterwards run the benchmarks 
Switch to screen share of creating EC2 instance, followed by the benchmarks Questions 
PS1 compliments of http://unix.stackexchange.com/a/20991
Check our Coursera course 
Thanks 

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Similar to Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets (20)

More from Turi, Inc.

More from Turi, Inc. (20)

Recently uploaded

Recently uploaded (20)

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Editor's Notes