4. Dato Confidential4
Business
must be intelligent
Machine learning
applications
• Recommenders
• Fraud detection
• Ad targeting
• Financial models
• Personalized medicine
• Churn prediction
• Smart UX
(video & text)
• Personal assistants
• IoT
• Socials networks
• Log analysis
Last decade:
Data management
Now:
Intelligent apps
?
Last 5 years:
Traditional analytics
9. Dato Confidential9
GraphLab Create in a Line
“A general-purpose machine learning Python library that
scales on large datasets.”
General purpose: classification, graph analytics…
Python API on top, C++ open-source engine below.
Scales vertically: more CPUs, RAM and faster disks.
Large datasets: disk bound, not RAM bound.
9
10. Dato Confidential10
What will we cover today?
1. Instantiating a machine in the Amazon EC2 cloud
• r3.8xlarge instance
• 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each
2. Run PageRank on a large graph
• CommonCrawl 2012 dataset – the internet as a graph
• 3.5 billion nodes, 128 billion links
3. Run Gradient Boosted Trees on a large dataset
• Criteo 1TB Click Logs Dataset
• 4.3 billion rows, 39 features (13 numerical, 26 categorical)
10
11. Dato Confidential11
What will you be able to do afterwards?
Instantiate an EC2 instance, grab our benchmark
notebooks, and try it yourself!
Everything is publicly available on github:
https://github.com/guy4261/glc_pagerank_benchmark
11
12. Dato Confidential12
Screen Primer
Command Action
sudo apt-get install –y screen Install screen
screen –S my_session Start a session named my_session
PS1=‘u@h(${STY}:${WINDOW}):w$’ Change your screen prompt (helpful)
# CTRL+A, then D Key combination to detach
screen -ls List all open screens
screen –r my_session Reattach to your screen
exit Exit the session and terminate the screen
Company began 7 years ago in Carnegie Mellon University as an open-source project.
Now a company with 50+ employees and a recently opened EMEA office here in Israel.
Customers
Yes, we are selling
(100+ paying customers, brand names)
Intelligent apps are predictive
From analytics (queries over known data) to predictive (discovering the unknown).
Supported data types
Creating a model pipeline
Steps in the model pipeline creation
From inspiration to production
# end of corporate slides
GLC in a line
We’re gonna see it all today
Really 3,443,082,324 vertices, 128,736,914,167 edges in CommonCrawl 2012.
Really 4,373,472,329 rows in Criteo.
Afterwards run the benchmarks
Switch to screen share of creating EC2 instance, followed by the benchmarks
Questions
PS1 compliments of http://unix.stackexchange.com/a/20991