This webinar showcases BigML's Summer Release, including a demonstration on how to quickly detect anomalous data with BigML.
BigML’s Summer release is headlined by Anomaly Detection, which can help automate a number of predictive tasks for fraud detection, security, quality control, diagnoses and more. Also included in the release (and demonstrated in the webinar) are support for model clusters, missing splits, client-side predictions and more.
For more information, visit: http://wp.me/p234d6-1RX
2. Today’s Webinar
• Speaker:
• Poul Petersen, CIO
• Moderator:
• Andrew Shikiar, VP Business Development
• Enter questions into chat box – we’ll answer some
via text; others at the end of the session
• For direct follow-up, email us at info@bigml.com
BigML Inc 2
4. Model Clusters
Use models to discover rules that describe clusters
5
6
7
3 1
2
4
Spicy Body Nutty
5.1 3.5 1.4
2.6 3.5
6.7 2.5 5.8
… … …
Spicy Body Nutty In 5?
5.1 3.5 1.4 TRUE
5.7 2.6 3.5 FALSE
6.7 2.5 5.8 TRUE
… … … …
In Cluster 5?
BigML Inc 4
5. Model Clusters
• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4
for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile, then
discover rules that distinguish the clusters from each
other.
BigML Inc 5
6. Missing Splits
Missing:
101010
Real World Data
… is messy
x?
• Define missing tokens: N/A, Null, etc
• Filter out missing values
• Add a new feature to replace missing values
• Default numeric values in cluster
• Proportional prediction for missing input data
• Allow splits on missing values
BigML Inc 6
7. Online Predictions
• Single predictions
• Computed in real-time using browser JS
• JS will be open sourced
• Available for models, ensembles, and clusters
BigML Inc 7
8. Fast(er) Ensembles
Fetch
Dataset
“F” secs
Transform
Dataset
“T” secs
Model
Dataset
“M” secs
Store
Model
“S” secs
Insight: if the dataset fits in memory, we can perform the
fetch and transform steps once and model quickly in memory
Old New Savings
Number of
Models “n”
Time
n * [ F + T + M + S ] F + T + n * [ M + S ] ( n - 1 ) * [ F + T ]
BigML Inc 8
9. Anomaly Detection
An unsupervised
algorithm to find
unusual data
quickly and easily
BigML Inc 9
10. Learning Tasks
Trees (Supervised Learning)
!
Provide: labeled data
Learning Task: be able to predict label
Cluster (Unsupervised Learning)
!
Provide: unlabeled data
Learning Task: group data by similarity
Anomalies (Unsupervised Learning)
!
Provide: unlabeled data
Learning Task: Rank data by dissimilarity
BigML Inc 10
11. Learning Tasks
sepal
length
sepal
width
petal
length
petal
width
species
5.1 3.5 1.4 0.2 setosa
5.7 2.6 3.5 1.0 versicolor
6.7 2.5 5.8 1.8 virginica
… … … … …
Inputs “X” “Y”
Learning Task:
Find function “f” such that:
f(X)≈Y
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Find “k” clusters such that
the data in each cluster is
self similar
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Assign value from 0 (similar)
to 1 (dissimilar) to each
instance.
BigML Inc 11
12. Anomalies
Isolation Forest:
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
Depth
“hard” to isolate
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
BigML Inc 12