SlideShare a Scribd company logo
1 of 32
Download to read offline
Machine Learning with Python
& Vertica using VerticaPy
Matteo Monaldi, Data Scientist
Bring Analytics to the next level
Data Science Challenges
Data Science Lifecycle
3
Data
Quality ?
Enough
Data ?
Business
Understanding
Data Mining &
Cleaning
Data
Exploration
Data
Preparation
Machine
Learning
Model
Evaluation
Good
Enough ?
Good
Enough ?
Data
Visualization
Model Deployment
& Maintenance
True
True
True
True
False
False
False
False
End of Project
Business Analyst Data Engineer Data Scientist
Data Visualization Process Data Science Process
Vertica Machine Learning
5
Vertica Supports the Entire Data Science Process
Deployment/
Management
Data Analysis/
Exploration
Data
Preparation
Model
Training
Model
Evaluation
Business
Understanding
In-Database
Scoring
Speed at Scale
Security
Statistical Summary
Sessionization
Pattern Matching
Date/Time Algebra
Window/Partition
Date Type
Handling
Sequences
and more…
Outlier Detection
Normalization
Imbalanced
Data Processing
Sampling
Test/Validation Split
Time Series
Missing Value
Imputation
and more…
K-Means
Support Vector
Machines
Logistic, Linear, Ridge
Regression
Naïve Bayes
Cross Validation
and more…
Model-level Stats
ROC Tables
Error Rate
Lift Table
Confusion Matrix
R-Squared
MSE
XGBoost
Filtering
Feature Selection
Correlation
Matrices
Table-Like
Management
Versioning
Authorization
and more…
Random Forests
and more…
Principal Component
Analysis
PMML
Import/Export
TensorFlow Import
Advantages of Vertica in-database Machine Learning
6
NODE 1 NODE 2 NODE N
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Eliminating overhead of data transfer
Data Security and Provenance
Model Storage and Management
Serving concurrent users
Highly scalable ML functionalities
Avoiding maintenance cost of a
seperate system
Sampling vs. full dataset
7
Large data results in
better generalization
Downsampling Full Dataset
Lack of generalization & Over-fitting Generalization & Data-driven
Anomaly Detection
Moving Windows
Sessionization
Interpolations
Missing Value
Normalization
Supervised Learning
SELECT * FROM models;
Vertica Model Management
ALTER MODEL mymodel RENAME to mykmeansmodel;
ALTER MODEL mykmeansmodel OWNER TO user1;
ALTER MODEL mykmeansmodel SET SCHEMA public;
DROP MODEL myLinearModel;
List Models
SELECT Summarize_Model(
'LogisticRegModel');
Summarize Models
Change Model Name, Owner and Schema
Delete Models
dbadmin
public
dbadmin
public
dbadmin
public
myKmeansModel
logisticRegModel
linearRegModel
kmeans
logistic regression
linear regression
845
780
399
schema_name owner_name model_name model_type model_size
x2
x1
Intercept
predictor
0.0048
-0.0345
2.548
coefficient
0.0126
0.0000
0.0000
p_value
x2 -0.9769 0.0678
2020-07-14
2021-05-07
2021-02-04
create_date
Manage Model
Security
View Models Alter Models Summarize
Models
Drop Models
Get Models
Attributes
SELECT GET_MODEL_ATTRIBUTES(USING PARAMETERS
model_name = 'LogisticRegModel');
Get Model Attributes
======= details ======= predictor|coefficient|std_err |z_value |p_value ---------+-----------+--------+--------+--------Intercept|
2.54895 | 0.39019| 6.53263| 0.00000 age| -0.03453| 0.00563|-6.13772| 0.00000fare| 0.00408| 0.00174| 2.34950|
0.01880 pclass | -0.97692| 0.11551|-8.45772| 0.00000============== regularization ============== type| lambda ----
+--------none| 1.00000 =========== call_string =========== logistic_reg('public.lr_titanic', 'titanic', '"survived"', '"age",
"fare", "pclass"' USING PARAMETERS optimizer='newton', epsilon=1e-06, max_iterations=100, regularization='none',
lambda=1, alpha=0.5) =============== Additional Info =============== Name|Value ------------------+----- iteration_count
| 4 rejected_row_count| 238 accepted_row_count| 996
GRANT USAGE ON MODEL myLinearModel TO user2;
Model’s Security
8
Multiple Ways of Solving Data Science Challenges
Vertica has a unique value in the Machine Learning space: In-database Machine Learning, in-database Data Science in Python, user-defined functions in 4
programming languages (C++, R, Python & Java), integration with TensorFlow & PMML allowing models import and export.
SQL Front End VerticaPy - Python API User Defined Functions Import TensorFlow Import & Export PMML
9
SQL 𝝀
VerticaPy
Bring Analytics to the next level with VerticaPy
Python Front-end & SQL Back-end
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
VerticaPy
In DataBase Data Science
Database Connection
Jupyter Notebooks
Vertica DataBase
High Security
Open-Source
Jupyter/Python – popular tool of
choice for data scientists & analysts
Much of the heavy computation
is done by Vertica. Model storage
& management in Vertica.
Data stays in Vertica (security,
integrity, scalability).
Users can contribute. No added cost
on software. Constantly updated
roadmap.
Aggregates
Build Models
vModel
vDataFrame
DYNAMIC
Data Exploration
Data
Preparation
Model Evaluation
11
VerticaPy
In Database Data Science
SELECT
SUMMARIZE_NUMCOL("SepalLengthCm",
"PetalWidthCm",
"PetalLengthCm",
"SepalWidthCm")
OVER ()
FROM "public"."iris"
Example – Summary Statistics
from verticapy import vDataframe
# Object Creation
vdf = vDataframe("public.iris", dsn = "VerticaDSN")
# Describe
vdf.describe()
Client
Output
Distributed
Execution
In [*]:
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
12
VerticaPy allows easy Data Preparation & Code Deployment
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
from verticapy.datasets import load_titanic
titanic = load_titanic()
# Doing some Data Preparation
titanic.fillna()
titanic["family_size"] = "parch + sibsp + 1"
titanic.normalize()
# Current vDataFrame relation
display(titanic.current_relation())
In [1]: SELECT
("fare" - 33.9637936739659) / (52.6247198802501) AS "fare",
"sex",
("body" - 164.14406779661) / (29.7495704692386) AS "body",
"pclass",
("age" - 30.1524573721163) / (12.9740056939445) AS "age",
"name",
"cabin",
"parch",
"survived",
"boat",
"ticket",
"embarked",
"home.dest",
"sibsp",
("family_size" - 1.88249594813614) / (1.58407574155133) AS "family_size"
FROM
(SELECT
COALESCE("fare", 33.9637936739659) AS "fare",
"sex",
COALESCE("body", 164.14406779661) AS "body",
("pclass" - 2.28444084278768) / (0.842485636190292) AS "pclass",
COALESCE("age", 30.1524573721163) AS "age",
"name",
COALESCE("cabin", 'C23 C25 C27') AS "cabin",
("parch" - 0.378444084278768) / (0.868604707790392) AS "parch",
("survived" - 0.364667747163695) / (0.481532018641288) AS "survived",
COALESCE("boat", '13') AS "boat",
"ticket",
COALESCE("embarked", 'S') AS "embarked",
COALESCE("home.dest", 'New York, NY') AS "home.dest",
("sibsp" - 0.504051863857374) / (1.04111727241629) AS "sibsp",
"parch" + "sibsp" + 1 AS "family_size"
FROM "public"."titanic") VERTICAPY_SUBTABLE
13
VerticaPy is a complete statistical software
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
Correlations
Heteroscedascity
Trend
&
Stationarity
Normality
Cramer’s V
Biserial Point
Kendall
Spearman
Pearson Breush-Pagan
Goldfeld-Quandt
White’s Lagrange
Engle
Augmented Dickey-Fuller
Mann-Kendall
Normaltest
Regular Joins
Joins
XGBoost
Random Forest
Tree
Based
Models
Linear Regression
Logistic Regression
LinearSVC
Linear
Models
KMeans
Bisecting KMeans
Clustering
Time Series Joins
Spatial Joins
14
Everything you need to visualize your data
COMPARISON CHARTS DISTRIBUTION CHARTS RELATIONSHIP CHARTS
TREND CHARTS GEOSPATIAL CHARTS ANIMATED CHARTS
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
PROPORTION CHARTS
16
Export your Graphics in various formats
Print Chart
Download PNG image
Download JPEG image
Download PDF document
Download SVG vector image
'<!DOCTYPE html>n<html lang="en">n <head>n <meta charset="utf-8" />n <link href="https://www.highcharts.com/highslide/highslide.css"
rel="stylesheet" />n <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>n <script
type="text/javascript" src="https://code.highcharts.com/6/highcharts.js"></script>n <script type="text/javascript"
src="https://code.highcharts.com/6/highcharts-more.js"></script>n <script type="text/javascript"
src="https://code.highcharts.com/6/modules/heatmap.js"></script>n <script type="text/javascript"
src="https://code.highcharts.com/6/modules/exporting.js"></script>n </head>n <body style="margin:0;padding:0">n <div id="container"
style="width:600px;height:400px;">Loading....</div>nnn <script>n $(function(){nnnnnn Highcharts.setOptions({"global": {}, "lang": {}});n var
option = {"chart": {"renderTo": "container", "width": 600, "height": 400, "inverted": true}, "colors": ["#263133", "#FE5016", "#0073E7", "#19A26B",
"#FCDB1F", "#000000", "#2A6A74", "#861889", "#00B4E0", "#90EE90", "#FF7F50", "#B03A89"], "credits": {"enabled": false}, "drilldown": {},
"exporting": {}, "labels": {}, "legend": {"enabled": false}, "loading": {}, "navigation": {}, "pane": {}, "plotOptions": {}, "series": {}, "subtitle": {}, "title":
{"text": ""}, "tooltip": {"headerFormat": "", "pointFormat": "{point.y}"}, "xAxis": {"reversed": false, "title": {"enabled": true, "text": "Contract"},
"maxPadding": 0.05, "showLastLabel": true, "categories": ["Month-to-month", "One year", "Two year"]}, "yAxis": {"title": {"text": "Churn_Rate
(%)"}}};nnn nn var chart = new Highcharts.Chart(option);nn var data = [{"data": [42.7096774193548, 11.2695179904956, 2.83185840707965],
"type": "bar", "name": "Churn_Rate (%)", "colorByPoint": true}];n var dataLen = data.length;n for (var ix = 0; ix < dataLen; ix++) {n
chart.addSeries(data[ix]);n }n nnnnnnnn n });n </script>nn </body>n</html>'
Build HTML page
Exporting to HTML allows to get a code which
can be injected independantly in many GUIs as
long as they support Javascript.
# High Charts Graphics are objects
# with numerous methods
class Highchart(builtins.object)
buildcontainer()
buildhtml()
buildhtmlheader()
save_file(filename = 'Chart')
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
VerticaPy Delphi: Automated Machine Learning
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
Logistic Regression XGBoost Random Forest SVM Naive Bayes
EFFICIENT
MODEST
EFFICIENT & PERFORMANT
PERFORMANT
SCORE
TIME
STD [PRECISION]
Auto Data Preparation
Uses One Hot Encoding, Label Encoding,
Missing Values Imputation and other data
preparation techniques to preprocess the data
Auto Grid Search CV
Uses different parameters grids to test many
combinations and it finds an optimal grid.
Auto Variables Selection
After the selection of the algorithm, it uses the
Stepwise algorithm to find a set of features.
17
Geospatial
SupportedSpatialObjects
GEOMETRY (x,y)
Spatial object with coordinates expressed as (x,y) pairs, defined in
the Cartesian plane. All calculations use Cartesian coordinates. As it
is a projection on the Cartesian plane, the computations are easier
to compare to the Geography type. Use this type when you can!
GEOGRAPHY (longitude,latitude)
Spatial object defined as on the surface of a perfect sphere, or a
spatial object in the WGS84 coordinate system. Coordinates are
expressed in longitude/latitude angular values, measured in degrees.
All calculations are in metres.
The maximum size of a GEOMETRY or GEOGRAPHY data
type is 10 MB ! You cannot modify the size or data type of
a GEOMETRY or GEOGRAPHY column after creation.
19
20
%load_ext verticapy.sql
In [1]: from verticapy.geo import *
from verticapy.datasets import load_world, load_cities
cities = vDataFrame("map.cities")
world = vDataFrame("map.world")
# Creating Index
create_index(world, "id", "geometry", "world_polygons", True)
# Computing the intersections
intersect(cities, "world_polygons", "id", x="lat", y="lon”)
In [3]:
VerticaPy SQL Magic
%%sql
SELECT STV_Create_Index(id, geometry
USING PARAMETERS
index = 'world_polygons’,
overwrite = true,
max_mem_mb=256) OVER() FROM pols;
SELECT STV_Intersect(lat, lon
USING PARAMETERS index = ' world_polygons’);
In [2]:
VerticaPy vDataFrame
Spatial Joins
1 2
3
4
5
6
point_id
1
2
3
4
5
6
polygon_id
3
3
3
3
2
1
INTERSECT
7
Time Series
Challenges with Irregular Time Series
6
5
3
0
9:26
3.5
6
9:20
9:00
5
9:41 16:32 16:40 17:00 17:19
1
∆𝑡 = 8𝑚𝑛𝑠
∆𝑡 = 6ℎ51𝑚𝑛𝑠
Inconsistent Models
Time Series models are most of the time
auto-regressive. Irregular TS leads to
wrong variable definition.
Wrong Aggregations
Computing aggregations may become
difficult as we may reach inconsistent
variables.
Huge Gaps
Irregular Time Series may lead to huge
gaps between records. It could change
the variable definition.
Difficult Joins
Joins may become difficult as they need
an exact match. In Vertica, TS Joins allow
to counter this problem.
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
22
Challenges with Irregular Time Series
value
09:00 5
09:20 6
09:26 3.5
09:41 0
16:32 3
16:40 6
17:00 5
17:19 1
value
10
11
12
12.5
13
13.5
14
14.5
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
?
?
?
?
?
?
Impossibility to join the different data sources
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
23
from verticapy import vDataFrame
sm_weather = vDataFrame("sm"."weather")
sm_consumption = vDataFrame("sm"." consumption")
sm_weather.join(sm_consumption,
how = "left",
on_interpolate = {"dateUTC": "dateUTC"},
expr2 = ["temperature", "humidity"])
Solution – Time Series Join
value
09:00 5
09:20 6
09:26 3.5
09:41 0
16:32 3
value
10
11
12
12.5
13
09:00
09:30
10:00
10:30
11:00
16:30
JOINING TO THE CLOSEST TIMESTAMP
SELECT * FROM
"sm"." weather " AS x
LEFT JOIN
"sm".”consumption" AS y
ON x."dateUTC" INTERPOLATE PREVIOUS VALUE y."dateUTC"
VerticaPy SQL Magic
VerticaPy vDataFrame
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
24
from verticapy import vDataFrame
# Creating the vDataFrame
sm_consumption = vDataFrame("sm"." consumption")
# Time Series Slicing & Interpolation
sm_consumption.asfreq(ts = "dateUTC",
rule = "30 minutes",
method = {"value": "linear"},
by = ["meterID"])
Solution – Time Series Slicing & Interpolation
SELECT
slice_time AS "dateUTC",
"meterID",
TS_FIRST_VALUE("value", 'linear') AS "value"
FROM
"sm"."consumption"
TIMESERIES
slice_time AS '30 minutes'
OVER (PARTITION BY "meterID" ORDER BY "dateUTC")
VerticaPy SQL Magic VerticaPy vDataFrame
value
09:00 5
09:20 6
09:26 3.5
09:41 0
16:32 3
value
09:00 5
09:30 2.56666667
10:00 0.13868613
10:30 0.35766424
11:00 0.576642336
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
25
Model Deployment
Deploy your Pipeline in Vertica
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
27
SELECT
"age",
"tenure",
PREDICT_LOGISTIC_REG("age", "tenure"
USING PARAMETERS model_name = 'telco.lr') AS "churn_score"
FROM
(SELECT
APPLY_NORMALIZE("age", "tenure"
USING PARAMETERS model_name = 'telco.normalizer')
FROM "telco"."customers") x Vertica built-in SQL
SELECT
PREDICT_TENSORFLOW(”customer_id", "age", "tenure"
USING PARAMETERS
model_name = 'telco.lr’,
num_passthru_cols = 1)
OVER(PARTITION BEST)
FROM "telco"."customers"; Tensor Flow Integration
SELECT
"age",
"tenure",
PREDICT_PMML("age", "tenure"
USING PARAMETERS
model_name = 'telco.lr’) AS "churn_score"
FROM "telco"."customers"; PMML Integration
SELECT
"age",
"tenure",
1 / (1 + EXP(– (0.5 * "age" – 0.3 * "tenure" – 0.9))) AS "churn_score"
FROM
(SELECT
("age” – 13) / 80 AS "age",
("tenure” – 5) / 48) AS "tenure"
FROM "telco"."customers") x
Standard SQL
SELECT
"age",
"tenure",
H2OModelScore("age", "tenure"
USING PARAMETERS
model_name = 'telco.lr’) AS "churn_score"
FROM "telco"."customers"; H2O Integration
SQL
SQL
SQL Script Scheduler
User Defined Functions
Bring your own Lambda function
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
Python Front-end & SQL Back-end
VerticaPy
In DataBase Data Science
Vertica Cluster
UDFs are also available in Java, C++ & R
Did you
know ?
Python Function UDF Generation
Python UDFs Generation
Process the data using blocks
Scalable in-DB Computations
Stored in all the cluster’s nodes
Flexible Python Implementation
Block 1
Block 2
Block 3
Block 4
𝝀
29
NEW VIDEOS
EVERY WEDNESDAY
SQL
SUBSCRIBE TO OUR CHANNEL
THE PYTHON API FOR VERTICA DATA SCIENCE AT SCALE
MACHINE LEARNING • BIG DATA • TIME SERIES • DYNAMIC CHARTS
www.github.com/vertica/VerticaPy
www.vertica.com/python/
www
Github
www.linkedin.com/company/verticapy/
Thank you.
www.vertica.com
Machine Learning with Python & Vertica using VerticaPy

More Related Content

Similar to Machine Learning with Python & Vertica using VerticaPy

Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Sri Ambati
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowJan Kirenz
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
 
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...In-Memory Computing Summit
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & RŁukasz Grala
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 

Similar to Machine Learning with Python & Vertica using VerticaPy (20)

Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
BigData
BigDataBigData
BigData
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 

Recently uploaded

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Recently uploaded (20)

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

Machine Learning with Python & Vertica using VerticaPy

  • 1. Machine Learning with Python & Vertica using VerticaPy Matteo Monaldi, Data Scientist Bring Analytics to the next level
  • 3. Data Science Lifecycle 3 Data Quality ? Enough Data ? Business Understanding Data Mining & Cleaning Data Exploration Data Preparation Machine Learning Model Evaluation Good Enough ? Good Enough ? Data Visualization Model Deployment & Maintenance True True True True False False False False End of Project Business Analyst Data Engineer Data Scientist Data Visualization Process Data Science Process
  • 5. 5 Vertica Supports the Entire Data Science Process Deployment/ Management Data Analysis/ Exploration Data Preparation Model Training Model Evaluation Business Understanding In-Database Scoring Speed at Scale Security Statistical Summary Sessionization Pattern Matching Date/Time Algebra Window/Partition Date Type Handling Sequences and more… Outlier Detection Normalization Imbalanced Data Processing Sampling Test/Validation Split Time Series Missing Value Imputation and more… K-Means Support Vector Machines Logistic, Linear, Ridge Regression Naïve Bayes Cross Validation and more… Model-level Stats ROC Tables Error Rate Lift Table Confusion Matrix R-Squared MSE XGBoost Filtering Feature Selection Correlation Matrices Table-Like Management Versioning Authorization and more… Random Forests and more… Principal Component Analysis PMML Import/Export TensorFlow Import
  • 6. Advantages of Vertica in-database Machine Learning 6 NODE 1 NODE 2 NODE N Schema Tables Models Schema Tables Models Schema Tables Models Network Eliminating overhead of data transfer Data Security and Provenance Model Storage and Management Serving concurrent users Highly scalable ML functionalities Avoiding maintenance cost of a seperate system
  • 7. Sampling vs. full dataset 7 Large data results in better generalization Downsampling Full Dataset Lack of generalization & Over-fitting Generalization & Data-driven Anomaly Detection Moving Windows Sessionization Interpolations Missing Value Normalization Supervised Learning
  • 8. SELECT * FROM models; Vertica Model Management ALTER MODEL mymodel RENAME to mykmeansmodel; ALTER MODEL mykmeansmodel OWNER TO user1; ALTER MODEL mykmeansmodel SET SCHEMA public; DROP MODEL myLinearModel; List Models SELECT Summarize_Model( 'LogisticRegModel'); Summarize Models Change Model Name, Owner and Schema Delete Models dbadmin public dbadmin public dbadmin public myKmeansModel logisticRegModel linearRegModel kmeans logistic regression linear regression 845 780 399 schema_name owner_name model_name model_type model_size x2 x1 Intercept predictor 0.0048 -0.0345 2.548 coefficient 0.0126 0.0000 0.0000 p_value x2 -0.9769 0.0678 2020-07-14 2021-05-07 2021-02-04 create_date Manage Model Security View Models Alter Models Summarize Models Drop Models Get Models Attributes SELECT GET_MODEL_ATTRIBUTES(USING PARAMETERS model_name = 'LogisticRegModel'); Get Model Attributes ======= details ======= predictor|coefficient|std_err |z_value |p_value ---------+-----------+--------+--------+--------Intercept| 2.54895 | 0.39019| 6.53263| 0.00000 age| -0.03453| 0.00563|-6.13772| 0.00000fare| 0.00408| 0.00174| 2.34950| 0.01880 pclass | -0.97692| 0.11551|-8.45772| 0.00000============== regularization ============== type| lambda ---- +--------none| 1.00000 =========== call_string =========== logistic_reg('public.lr_titanic', 'titanic', '"survived"', '"age", "fare", "pclass"' USING PARAMETERS optimizer='newton', epsilon=1e-06, max_iterations=100, regularization='none', lambda=1, alpha=0.5) =============== Additional Info =============== Name|Value ------------------+----- iteration_count | 4 rejected_row_count| 238 accepted_row_count| 996 GRANT USAGE ON MODEL myLinearModel TO user2; Model’s Security 8
  • 9. Multiple Ways of Solving Data Science Challenges Vertica has a unique value in the Machine Learning space: In-database Machine Learning, in-database Data Science in Python, user-defined functions in 4 programming languages (C++, R, Python & Java), integration with TensorFlow & PMML allowing models import and export. SQL Front End VerticaPy - Python API User Defined Functions Import TensorFlow Import & Export PMML 9 SQL 𝝀
  • 11. Bring Analytics to the next level with VerticaPy Python Front-end & SQL Back-end https://www.vertica.com/python/ https://github.com/vertica/VerticaPy VerticaPy In DataBase Data Science Database Connection Jupyter Notebooks Vertica DataBase High Security Open-Source Jupyter/Python – popular tool of choice for data scientists & analysts Much of the heavy computation is done by Vertica. Model storage & management in Vertica. Data stays in Vertica (security, integrity, scalability). Users can contribute. No added cost on software. Constantly updated roadmap. Aggregates Build Models vModel vDataFrame DYNAMIC Data Exploration Data Preparation Model Evaluation 11
  • 12. VerticaPy In Database Data Science SELECT SUMMARIZE_NUMCOL("SepalLengthCm", "PetalWidthCm", "PetalLengthCm", "SepalWidthCm") OVER () FROM "public"."iris" Example – Summary Statistics from verticapy import vDataframe # Object Creation vdf = vDataframe("public.iris", dsn = "VerticaDSN") # Describe vdf.describe() Client Output Distributed Execution In [*]: https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 12
  • 13. VerticaPy allows easy Data Preparation & Code Deployment https://www.vertica.com/python/ https://github.com/vertica/VerticaPy from verticapy.datasets import load_titanic titanic = load_titanic() # Doing some Data Preparation titanic.fillna() titanic["family_size"] = "parch + sibsp + 1" titanic.normalize() # Current vDataFrame relation display(titanic.current_relation()) In [1]: SELECT ("fare" - 33.9637936739659) / (52.6247198802501) AS "fare", "sex", ("body" - 164.14406779661) / (29.7495704692386) AS "body", "pclass", ("age" - 30.1524573721163) / (12.9740056939445) AS "age", "name", "cabin", "parch", "survived", "boat", "ticket", "embarked", "home.dest", "sibsp", ("family_size" - 1.88249594813614) / (1.58407574155133) AS "family_size" FROM (SELECT COALESCE("fare", 33.9637936739659) AS "fare", "sex", COALESCE("body", 164.14406779661) AS "body", ("pclass" - 2.28444084278768) / (0.842485636190292) AS "pclass", COALESCE("age", 30.1524573721163) AS "age", "name", COALESCE("cabin", 'C23 C25 C27') AS "cabin", ("parch" - 0.378444084278768) / (0.868604707790392) AS "parch", ("survived" - 0.364667747163695) / (0.481532018641288) AS "survived", COALESCE("boat", '13') AS "boat", "ticket", COALESCE("embarked", 'S') AS "embarked", COALESCE("home.dest", 'New York, NY') AS "home.dest", ("sibsp" - 0.504051863857374) / (1.04111727241629) AS "sibsp", "parch" + "sibsp" + 1 AS "family_size" FROM "public"."titanic") VERTICAPY_SUBTABLE 13
  • 14. VerticaPy is a complete statistical software https://www.vertica.com/python/ https://github.com/vertica/VerticaPy Correlations Heteroscedascity Trend & Stationarity Normality Cramer’s V Biserial Point Kendall Spearman Pearson Breush-Pagan Goldfeld-Quandt White’s Lagrange Engle Augmented Dickey-Fuller Mann-Kendall Normaltest Regular Joins Joins XGBoost Random Forest Tree Based Models Linear Regression Logistic Regression LinearSVC Linear Models KMeans Bisecting KMeans Clustering Time Series Joins Spatial Joins 14
  • 15. Everything you need to visualize your data COMPARISON CHARTS DISTRIBUTION CHARTS RELATIONSHIP CHARTS TREND CHARTS GEOSPATIAL CHARTS ANIMATED CHARTS https://www.vertica.com/python/ https://github.com/vertica/VerticaPy PROPORTION CHARTS
  • 16. 16 Export your Graphics in various formats Print Chart Download PNG image Download JPEG image Download PDF document Download SVG vector image '<!DOCTYPE html>n<html lang="en">n <head>n <meta charset="utf-8" />n <link href="https://www.highcharts.com/highslide/highslide.css" rel="stylesheet" />n <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/highcharts.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/highcharts-more.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/modules/heatmap.js"></script>n <script type="text/javascript" src="https://code.highcharts.com/6/modules/exporting.js"></script>n </head>n <body style="margin:0;padding:0">n <div id="container" style="width:600px;height:400px;">Loading....</div>nnn <script>n $(function(){nnnnnn Highcharts.setOptions({"global": {}, "lang": {}});n var option = {"chart": {"renderTo": "container", "width": 600, "height": 400, "inverted": true}, "colors": ["#263133", "#FE5016", "#0073E7", "#19A26B", "#FCDB1F", "#000000", "#2A6A74", "#861889", "#00B4E0", "#90EE90", "#FF7F50", "#B03A89"], "credits": {"enabled": false}, "drilldown": {}, "exporting": {}, "labels": {}, "legend": {"enabled": false}, "loading": {}, "navigation": {}, "pane": {}, "plotOptions": {}, "series": {}, "subtitle": {}, "title": {"text": ""}, "tooltip": {"headerFormat": "", "pointFormat": "{point.y}"}, "xAxis": {"reversed": false, "title": {"enabled": true, "text": "Contract"}, "maxPadding": 0.05, "showLastLabel": true, "categories": ["Month-to-month", "One year", "Two year"]}, "yAxis": {"title": {"text": "Churn_Rate (%)"}}};nnn nn var chart = new Highcharts.Chart(option);nn var data = [{"data": [42.7096774193548, 11.2695179904956, 2.83185840707965], "type": "bar", "name": "Churn_Rate (%)", "colorByPoint": true}];n var dataLen = data.length;n for (var ix = 0; ix < dataLen; ix++) {n chart.addSeries(data[ix]);n }n nnnnnnnn n });n </script>nn </body>n</html>' Build HTML page Exporting to HTML allows to get a code which can be injected independantly in many GUIs as long as they support Javascript. # High Charts Graphics are objects # with numerous methods class Highchart(builtins.object) buildcontainer() buildhtml() buildhtmlheader() save_file(filename = 'Chart') https://www.vertica.com/python/ https://github.com/vertica/VerticaPy
  • 17. VerticaPy Delphi: Automated Machine Learning https://www.vertica.com/python/ https://github.com/vertica/VerticaPy Logistic Regression XGBoost Random Forest SVM Naive Bayes EFFICIENT MODEST EFFICIENT & PERFORMANT PERFORMANT SCORE TIME STD [PRECISION] Auto Data Preparation Uses One Hot Encoding, Label Encoding, Missing Values Imputation and other data preparation techniques to preprocess the data Auto Grid Search CV Uses different parameters grids to test many combinations and it finds an optimal grid. Auto Variables Selection After the selection of the algorithm, it uses the Stepwise algorithm to find a set of features. 17
  • 19. SupportedSpatialObjects GEOMETRY (x,y) Spatial object with coordinates expressed as (x,y) pairs, defined in the Cartesian plane. All calculations use Cartesian coordinates. As it is a projection on the Cartesian plane, the computations are easier to compare to the Geography type. Use this type when you can! GEOGRAPHY (longitude,latitude) Spatial object defined as on the surface of a perfect sphere, or a spatial object in the WGS84 coordinate system. Coordinates are expressed in longitude/latitude angular values, measured in degrees. All calculations are in metres. The maximum size of a GEOMETRY or GEOGRAPHY data type is 10 MB ! You cannot modify the size or data type of a GEOMETRY or GEOGRAPHY column after creation. 19
  • 20. 20 %load_ext verticapy.sql In [1]: from verticapy.geo import * from verticapy.datasets import load_world, load_cities cities = vDataFrame("map.cities") world = vDataFrame("map.world") # Creating Index create_index(world, "id", "geometry", "world_polygons", True) # Computing the intersections intersect(cities, "world_polygons", "id", x="lat", y="lon”) In [3]: VerticaPy SQL Magic %%sql SELECT STV_Create_Index(id, geometry USING PARAMETERS index = 'world_polygons’, overwrite = true, max_mem_mb=256) OVER() FROM pols; SELECT STV_Intersect(lat, lon USING PARAMETERS index = ' world_polygons’); In [2]: VerticaPy vDataFrame Spatial Joins 1 2 3 4 5 6 point_id 1 2 3 4 5 6 polygon_id 3 3 3 3 2 1 INTERSECT 7
  • 22. Challenges with Irregular Time Series 6 5 3 0 9:26 3.5 6 9:20 9:00 5 9:41 16:32 16:40 17:00 17:19 1 ∆𝑡 = 8𝑚𝑛𝑠 ∆𝑡 = 6ℎ51𝑚𝑛𝑠 Inconsistent Models Time Series models are most of the time auto-regressive. Irregular TS leads to wrong variable definition. Wrong Aggregations Computing aggregations may become difficult as we may reach inconsistent variables. Huge Gaps Irregular Time Series may lead to huge gaps between records. It could change the variable definition. Difficult Joins Joins may become difficult as they need an exact match. In Vertica, TS Joins allow to counter this problem. https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 22
  • 23. Challenges with Irregular Time Series value 09:00 5 09:20 6 09:26 3.5 09:41 0 16:32 3 16:40 6 17:00 5 17:19 1 value 10 11 12 12.5 13 13.5 14 14.5 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 ? ? ? ? ? ? Impossibility to join the different data sources https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 23
  • 24. from verticapy import vDataFrame sm_weather = vDataFrame("sm"."weather") sm_consumption = vDataFrame("sm"." consumption") sm_weather.join(sm_consumption, how = "left", on_interpolate = {"dateUTC": "dateUTC"}, expr2 = ["temperature", "humidity"]) Solution – Time Series Join value 09:00 5 09:20 6 09:26 3.5 09:41 0 16:32 3 value 10 11 12 12.5 13 09:00 09:30 10:00 10:30 11:00 16:30 JOINING TO THE CLOSEST TIMESTAMP SELECT * FROM "sm"." weather " AS x LEFT JOIN "sm".”consumption" AS y ON x."dateUTC" INTERPOLATE PREVIOUS VALUE y."dateUTC" VerticaPy SQL Magic VerticaPy vDataFrame https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 24
  • 25. from verticapy import vDataFrame # Creating the vDataFrame sm_consumption = vDataFrame("sm"." consumption") # Time Series Slicing & Interpolation sm_consumption.asfreq(ts = "dateUTC", rule = "30 minutes", method = {"value": "linear"}, by = ["meterID"]) Solution – Time Series Slicing & Interpolation SELECT slice_time AS "dateUTC", "meterID", TS_FIRST_VALUE("value", 'linear') AS "value" FROM "sm"."consumption" TIMESERIES slice_time AS '30 minutes' OVER (PARTITION BY "meterID" ORDER BY "dateUTC") VerticaPy SQL Magic VerticaPy vDataFrame value 09:00 5 09:20 6 09:26 3.5 09:41 0 16:32 3 value 09:00 5 09:30 2.56666667 10:00 0.13868613 10:30 0.35766424 11:00 0.576642336 https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 25
  • 27. Deploy your Pipeline in Vertica https://www.vertica.com/python/ https://github.com/vertica/VerticaPy 27 SELECT "age", "tenure", PREDICT_LOGISTIC_REG("age", "tenure" USING PARAMETERS model_name = 'telco.lr') AS "churn_score" FROM (SELECT APPLY_NORMALIZE("age", "tenure" USING PARAMETERS model_name = 'telco.normalizer') FROM "telco"."customers") x Vertica built-in SQL SELECT PREDICT_TENSORFLOW(”customer_id", "age", "tenure" USING PARAMETERS model_name = 'telco.lr’, num_passthru_cols = 1) OVER(PARTITION BEST) FROM "telco"."customers"; Tensor Flow Integration SELECT "age", "tenure", PREDICT_PMML("age", "tenure" USING PARAMETERS model_name = 'telco.lr’) AS "churn_score" FROM "telco"."customers"; PMML Integration SELECT "age", "tenure", 1 / (1 + EXP(– (0.5 * "age" – 0.3 * "tenure" – 0.9))) AS "churn_score" FROM (SELECT ("age” – 13) / 80 AS "age", ("tenure” – 5) / 48) AS "tenure" FROM "telco"."customers") x Standard SQL SELECT "age", "tenure", H2OModelScore("age", "tenure" USING PARAMETERS model_name = 'telco.lr’) AS "churn_score" FROM "telco"."customers"; H2O Integration SQL SQL SQL Script Scheduler
  • 29. Bring your own Lambda function https://www.vertica.com/python/ https://github.com/vertica/VerticaPy Python Front-end & SQL Back-end VerticaPy In DataBase Data Science Vertica Cluster UDFs are also available in Java, C++ & R Did you know ? Python Function UDF Generation Python UDFs Generation Process the data using blocks Scalable in-DB Computations Stored in all the cluster’s nodes Flexible Python Implementation Block 1 Block 2 Block 3 Block 4 𝝀 29
  • 30. NEW VIDEOS EVERY WEDNESDAY SQL SUBSCRIBE TO OUR CHANNEL THE PYTHON API FOR VERTICA DATA SCIENCE AT SCALE MACHINE LEARNING • BIG DATA • TIME SERIES • DYNAMIC CHARTS www.github.com/vertica/VerticaPy www.vertica.com/python/ www Github www.linkedin.com/company/verticapy/