3. Data Science Lifecycle
3
Data
Quality ?
Enough
Data ?
Business
Understanding
Data Mining &
Cleaning
Data
Exploration
Data
Preparation
Machine
Learning
Model
Evaluation
Good
Enough ?
Good
Enough ?
Data
Visualization
Model Deployment
& Maintenance
True
True
True
True
False
False
False
False
End of Project
Business Analyst Data Engineer Data Scientist
Data Visualization Process Data Science Process
5. 5
Vertica Supports the Entire Data Science Process
Deployment/
Management
Data Analysis/
Exploration
Data
Preparation
Model
Training
Model
Evaluation
Business
Understanding
In-Database
Scoring
Speed at Scale
Security
Statistical Summary
Sessionization
Pattern Matching
Date/Time Algebra
Window/Partition
Date Type
Handling
Sequences
and more…
Outlier Detection
Normalization
Imbalanced
Data Processing
Sampling
Test/Validation Split
Time Series
Missing Value
Imputation
and more…
K-Means
Support Vector
Machines
Logistic, Linear, Ridge
Regression
Naïve Bayes
Cross Validation
and more…
Model-level Stats
ROC Tables
Error Rate
Lift Table
Confusion Matrix
R-Squared
MSE
XGBoost
Filtering
Feature Selection
Correlation
Matrices
Table-Like
Management
Versioning
Authorization
and more…
Random Forests
and more…
Principal Component
Analysis
PMML
Import/Export
TensorFlow Import
6. Advantages of Vertica in-database Machine Learning
6
NODE 1 NODE 2 NODE N
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
Eliminating overhead of data transfer
Data Security and Provenance
Model Storage and Management
Serving concurrent users
Highly scalable ML functionalities
Avoiding maintenance cost of a
seperate system
7. Sampling vs. full dataset
7
Large data results in
better generalization
Downsampling Full Dataset
Lack of generalization & Over-fitting Generalization & Data-driven
Anomaly Detection
Moving Windows
Sessionization
Interpolations
Missing Value
Normalization
Supervised Learning
8. SELECT * FROM models;
Vertica Model Management
ALTER MODEL mymodel RENAME to mykmeansmodel;
ALTER MODEL mykmeansmodel OWNER TO user1;
ALTER MODEL mykmeansmodel SET SCHEMA public;
DROP MODEL myLinearModel;
List Models
SELECT Summarize_Model(
'LogisticRegModel');
Summarize Models
Change Model Name, Owner and Schema
Delete Models
dbadmin
public
dbadmin
public
dbadmin
public
myKmeansModel
logisticRegModel
linearRegModel
kmeans
logistic regression
linear regression
845
780
399
schema_name owner_name model_name model_type model_size
x2
x1
Intercept
predictor
0.0048
-0.0345
2.548
coefficient
0.0126
0.0000
0.0000
p_value
x2 -0.9769 0.0678
2020-07-14
2021-05-07
2021-02-04
create_date
Manage Model
Security
View Models Alter Models Summarize
Models
Drop Models
Get Models
Attributes
SELECT GET_MODEL_ATTRIBUTES(USING PARAMETERS
model_name = 'LogisticRegModel');
Get Model Attributes
======= details ======= predictor|coefficient|std_err |z_value |p_value ---------+-----------+--------+--------+--------Intercept|
2.54895 | 0.39019| 6.53263| 0.00000 age| -0.03453| 0.00563|-6.13772| 0.00000fare| 0.00408| 0.00174| 2.34950|
0.01880 pclass | -0.97692| 0.11551|-8.45772| 0.00000============== regularization ============== type| lambda ----
+--------none| 1.00000 =========== call_string =========== logistic_reg('public.lr_titanic', 'titanic', '"survived"', '"age",
"fare", "pclass"' USING PARAMETERS optimizer='newton', epsilon=1e-06, max_iterations=100, regularization='none',
lambda=1, alpha=0.5) =============== Additional Info =============== Name|Value ------------------+----- iteration_count
| 4 rejected_row_count| 238 accepted_row_count| 996
GRANT USAGE ON MODEL myLinearModel TO user2;
Model’s Security
8
9. Multiple Ways of Solving Data Science Challenges
Vertica has a unique value in the Machine Learning space: In-database Machine Learning, in-database Data Science in Python, user-defined functions in 4
programming languages (C++, R, Python & Java), integration with TensorFlow & PMML allowing models import and export.
SQL Front End VerticaPy - Python API User Defined Functions Import TensorFlow Import & Export PMML
9
SQL 𝝀
11. Bring Analytics to the next level with VerticaPy
Python Front-end & SQL Back-end
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
VerticaPy
In DataBase Data Science
Database Connection
Jupyter Notebooks
Vertica DataBase
High Security
Open-Source
Jupyter/Python – popular tool of
choice for data scientists & analysts
Much of the heavy computation
is done by Vertica. Model storage
& management in Vertica.
Data stays in Vertica (security,
integrity, scalability).
Users can contribute. No added cost
on software. Constantly updated
roadmap.
Aggregates
Build Models
vModel
vDataFrame
DYNAMIC
Data Exploration
Data
Preparation
Model Evaluation
11
12. VerticaPy
In Database Data Science
SELECT
SUMMARIZE_NUMCOL("SepalLengthCm",
"PetalWidthCm",
"PetalLengthCm",
"SepalWidthCm")
OVER ()
FROM "public"."iris"
Example – Summary Statistics
from verticapy import vDataframe
# Object Creation
vdf = vDataframe("public.iris", dsn = "VerticaDSN")
# Describe
vdf.describe()
Client
Output
Distributed
Execution
In [*]:
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
12
13. VerticaPy allows easy Data Preparation & Code Deployment
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
from verticapy.datasets import load_titanic
titanic = load_titanic()
# Doing some Data Preparation
titanic.fillna()
titanic["family_size"] = "parch + sibsp + 1"
titanic.normalize()
# Current vDataFrame relation
display(titanic.current_relation())
In [1]: SELECT
("fare" - 33.9637936739659) / (52.6247198802501) AS "fare",
"sex",
("body" - 164.14406779661) / (29.7495704692386) AS "body",
"pclass",
("age" - 30.1524573721163) / (12.9740056939445) AS "age",
"name",
"cabin",
"parch",
"survived",
"boat",
"ticket",
"embarked",
"home.dest",
"sibsp",
("family_size" - 1.88249594813614) / (1.58407574155133) AS "family_size"
FROM
(SELECT
COALESCE("fare", 33.9637936739659) AS "fare",
"sex",
COALESCE("body", 164.14406779661) AS "body",
("pclass" - 2.28444084278768) / (0.842485636190292) AS "pclass",
COALESCE("age", 30.1524573721163) AS "age",
"name",
COALESCE("cabin", 'C23 C25 C27') AS "cabin",
("parch" - 0.378444084278768) / (0.868604707790392) AS "parch",
("survived" - 0.364667747163695) / (0.481532018641288) AS "survived",
COALESCE("boat", '13') AS "boat",
"ticket",
COALESCE("embarked", 'S') AS "embarked",
COALESCE("home.dest", 'New York, NY') AS "home.dest",
("sibsp" - 0.504051863857374) / (1.04111727241629) AS "sibsp",
"parch" + "sibsp" + 1 AS "family_size"
FROM "public"."titanic") VERTICAPY_SUBTABLE
13
14. VerticaPy is a complete statistical software
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
Correlations
Heteroscedascity
Trend
&
Stationarity
Normality
Cramer’s V
Biserial Point
Kendall
Spearman
Pearson Breush-Pagan
Goldfeld-Quandt
White’s Lagrange
Engle
Augmented Dickey-Fuller
Mann-Kendall
Normaltest
Regular Joins
Joins
XGBoost
Random Forest
Tree
Based
Models
Linear Regression
Logistic Regression
LinearSVC
Linear
Models
KMeans
Bisecting KMeans
Clustering
Time Series Joins
Spatial Joins
14
15. Everything you need to visualize your data
COMPARISON CHARTS DISTRIBUTION CHARTS RELATIONSHIP CHARTS
TREND CHARTS GEOSPATIAL CHARTS ANIMATED CHARTS
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
PROPORTION CHARTS
16. 16
Export your Graphics in various formats
Print Chart
Download PNG image
Download JPEG image
Download PDF document
Download SVG vector image
'<!DOCTYPE html>n<html lang="en">n <head>n <meta charset="utf-8" />n <link href="https://www.highcharts.com/highslide/highslide.css"
rel="stylesheet" />n <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>n <script
type="text/javascript" src="https://code.highcharts.com/6/highcharts.js"></script>n <script type="text/javascript"
src="https://code.highcharts.com/6/highcharts-more.js"></script>n <script type="text/javascript"
src="https://code.highcharts.com/6/modules/heatmap.js"></script>n <script type="text/javascript"
src="https://code.highcharts.com/6/modules/exporting.js"></script>n </head>n <body style="margin:0;padding:0">n <div id="container"
style="width:600px;height:400px;">Loading....</div>nnn <script>n $(function(){nnnnnn Highcharts.setOptions({"global": {}, "lang": {}});n var
option = {"chart": {"renderTo": "container", "width": 600, "height": 400, "inverted": true}, "colors": ["#263133", "#FE5016", "#0073E7", "#19A26B",
"#FCDB1F", "#000000", "#2A6A74", "#861889", "#00B4E0", "#90EE90", "#FF7F50", "#B03A89"], "credits": {"enabled": false}, "drilldown": {},
"exporting": {}, "labels": {}, "legend": {"enabled": false}, "loading": {}, "navigation": {}, "pane": {}, "plotOptions": {}, "series": {}, "subtitle": {}, "title":
{"text": ""}, "tooltip": {"headerFormat": "", "pointFormat": "{point.y}"}, "xAxis": {"reversed": false, "title": {"enabled": true, "text": "Contract"},
"maxPadding": 0.05, "showLastLabel": true, "categories": ["Month-to-month", "One year", "Two year"]}, "yAxis": {"title": {"text": "Churn_Rate
(%)"}}};nnn nn var chart = new Highcharts.Chart(option);nn var data = [{"data": [42.7096774193548, 11.2695179904956, 2.83185840707965],
"type": "bar", "name": "Churn_Rate (%)", "colorByPoint": true}];n var dataLen = data.length;n for (var ix = 0; ix < dataLen; ix++) {n
chart.addSeries(data[ix]);n }n nnnnnnnn n });n </script>nn </body>n</html>'
Build HTML page
Exporting to HTML allows to get a code which
can be injected independantly in many GUIs as
long as they support Javascript.
# High Charts Graphics are objects
# with numerous methods
class Highchart(builtins.object)
buildcontainer()
buildhtml()
buildhtmlheader()
save_file(filename = 'Chart')
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
17. VerticaPy Delphi: Automated Machine Learning
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
Logistic Regression XGBoost Random Forest SVM Naive Bayes
EFFICIENT
MODEST
EFFICIENT & PERFORMANT
PERFORMANT
SCORE
TIME
STD [PRECISION]
Auto Data Preparation
Uses One Hot Encoding, Label Encoding,
Missing Values Imputation and other data
preparation techniques to preprocess the data
Auto Grid Search CV
Uses different parameters grids to test many
combinations and it finds an optimal grid.
Auto Variables Selection
After the selection of the algorithm, it uses the
Stepwise algorithm to find a set of features.
17
19. SupportedSpatialObjects
GEOMETRY (x,y)
Spatial object with coordinates expressed as (x,y) pairs, defined in
the Cartesian plane. All calculations use Cartesian coordinates. As it
is a projection on the Cartesian plane, the computations are easier
to compare to the Geography type. Use this type when you can!
GEOGRAPHY (longitude,latitude)
Spatial object defined as on the surface of a perfect sphere, or a
spatial object in the WGS84 coordinate system. Coordinates are
expressed in longitude/latitude angular values, measured in degrees.
All calculations are in metres.
The maximum size of a GEOMETRY or GEOGRAPHY data
type is 10 MB ! You cannot modify the size or data type of
a GEOMETRY or GEOGRAPHY column after creation.
19
20. 20
%load_ext verticapy.sql
In [1]: from verticapy.geo import *
from verticapy.datasets import load_world, load_cities
cities = vDataFrame("map.cities")
world = vDataFrame("map.world")
# Creating Index
create_index(world, "id", "geometry", "world_polygons", True)
# Computing the intersections
intersect(cities, "world_polygons", "id", x="lat", y="lon”)
In [3]:
VerticaPy SQL Magic
%%sql
SELECT STV_Create_Index(id, geometry
USING PARAMETERS
index = 'world_polygons’,
overwrite = true,
max_mem_mb=256) OVER() FROM pols;
SELECT STV_Intersect(lat, lon
USING PARAMETERS index = ' world_polygons’);
In [2]:
VerticaPy vDataFrame
Spatial Joins
1 2
3
4
5
6
point_id
1
2
3
4
5
6
polygon_id
3
3
3
3
2
1
INTERSECT
7
22. Challenges with Irregular Time Series
6
5
3
0
9:26
3.5
6
9:20
9:00
5
9:41 16:32 16:40 17:00 17:19
1
∆𝑡 = 8𝑚𝑛𝑠
∆𝑡 = 6ℎ51𝑚𝑛𝑠
Inconsistent Models
Time Series models are most of the time
auto-regressive. Irregular TS leads to
wrong variable definition.
Wrong Aggregations
Computing aggregations may become
difficult as we may reach inconsistent
variables.
Huge Gaps
Irregular Time Series may lead to huge
gaps between records. It could change
the variable definition.
Difficult Joins
Joins may become difficult as they need
an exact match. In Vertica, TS Joins allow
to counter this problem.
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
22
23. Challenges with Irregular Time Series
value
09:00 5
09:20 6
09:26 3.5
09:41 0
16:32 3
16:40 6
17:00 5
17:19 1
value
10
11
12
12.5
13
13.5
14
14.5
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
?
?
?
?
?
?
Impossibility to join the different data sources
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
23
24. from verticapy import vDataFrame
sm_weather = vDataFrame("sm"."weather")
sm_consumption = vDataFrame("sm"." consumption")
sm_weather.join(sm_consumption,
how = "left",
on_interpolate = {"dateUTC": "dateUTC"},
expr2 = ["temperature", "humidity"])
Solution – Time Series Join
value
09:00 5
09:20 6
09:26 3.5
09:41 0
16:32 3
value
10
11
12
12.5
13
09:00
09:30
10:00
10:30
11:00
16:30
JOINING TO THE CLOSEST TIMESTAMP
SELECT * FROM
"sm"." weather " AS x
LEFT JOIN
"sm".”consumption" AS y
ON x."dateUTC" INTERPOLATE PREVIOUS VALUE y."dateUTC"
VerticaPy SQL Magic
VerticaPy vDataFrame
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
24
25. from verticapy import vDataFrame
# Creating the vDataFrame
sm_consumption = vDataFrame("sm"." consumption")
# Time Series Slicing & Interpolation
sm_consumption.asfreq(ts = "dateUTC",
rule = "30 minutes",
method = {"value": "linear"},
by = ["meterID"])
Solution – Time Series Slicing & Interpolation
SELECT
slice_time AS "dateUTC",
"meterID",
TS_FIRST_VALUE("value", 'linear') AS "value"
FROM
"sm"."consumption"
TIMESERIES
slice_time AS '30 minutes'
OVER (PARTITION BY "meterID" ORDER BY "dateUTC")
VerticaPy SQL Magic VerticaPy vDataFrame
value
09:00 5
09:20 6
09:26 3.5
09:41 0
16:32 3
value
09:00 5
09:30 2.56666667
10:00 0.13868613
10:30 0.35766424
11:00 0.576642336
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
25
29. Bring your own Lambda function
https://www.vertica.com/python/
https://github.com/vertica/VerticaPy
Python Front-end & SQL Back-end
VerticaPy
In DataBase Data Science
Vertica Cluster
UDFs are also available in Java, C++ & R
Did you
know ?
Python Function UDF Generation
Python UDFs Generation
Process the data using blocks
Scalable in-DB Computations
Stored in all the cluster’s nodes
Flexible Python Implementation
Block 1
Block 2
Block 3
Block 4
𝝀
29
30. NEW VIDEOS
EVERY WEDNESDAY
SQL
SUBSCRIBE TO OUR CHANNEL
THE PYTHON API FOR VERTICA DATA SCIENCE AT SCALE
MACHINE LEARNING • BIG DATA • TIME SERIES • DYNAMIC CHARTS
www.github.com/vertica/VerticaPy
www.vertica.com/python/
www
Github
www.linkedin.com/company/verticapy/