Data infrastructure for the other 90% of companies

@martin_loetzsch
Dr. Martin Loetzsch
Data Council Meetup Kickoff in Berlin
May 2019
Data infrastructure
for the other 90% of companies

!2
Which technology?
@martin_loetzsch

!3
Your are not Google
@martin_loetzsch
You are also not Amazon, LinkedIn, Uber etc.
https://www.slideshare.net/zhenxiao/real-time-analytics-at-uber-
strata-data-2019
https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb

!4
Hipster technologies
@martin_loetzsch
Say no sometimes

All the data of the company in one place
 
Data is
the single source of truth
easy to access
documented
embedded into the organisation
Integration of different domains 
 
 
 
 
 
 
 
 
 
 
Main challenges
Consistency & correctness
Changeability
Complexity
Transparency
!5
Data warehouse = integrated data
@martin_loetzsch
Nowadays required for running a business
application
databases
events
csv ﬁles
apis
reporting
crm
marketing
…
search
pricing
DWH orders
users
products
price  
histories
emails
clicks
…
…
operation 
events

!6
Data Engineering
@martin_loetzsch

!7
Picking a database
@martin_loetzsch

Warehouse adoption from 2016 to today
 
 
 
 
 
 
 
 
 
 
 
 
Based on three years of segment.io customer adoption
(https://twitter.com/segment/status/1125891660800413697) 
BigQuery
When you have terabytes of stuff
For raw event storage & processing
 
Snowflake
When you are not able to run a database yourself
When you can’t write efficient queries
Redshift
Can’t recommend for ETL
When OLAP performance becomes a problem
ClickHouse
For very fast OLAP
!8
If in doubt, use PostgreSQL
@martin_loetzsch
Boring, battle-tested work horse

!9
ETL/ Workflows
@martin_loetzsch

Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code  
Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
Functional data engineering
Reproducability, idempotency, (immutability).
Running a pipeline on the same inputs will always produce the same
results. No side effects.
https://medium.com/@maximebeauchemin/functional-data-engineering-
a-modern-paradigm-for-batch-data-processing-2327ec32c42a
!10
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices

from execute_sql import execute_sql_file, execute_sql_statment
# create utility functions in PostgreSQL
execute_sql_statment('DROP SCHEMA IF EXISTS util CASCADE; CREATE
SCHEMA util;')
execute_sql_statment('CREATE EXTENSION IF NOT EXISTS pg_trgm;')
execute_sql_file('utils/indexes_and_constraints.sql',
echo_queries=False)
execute_sql_file('utils/schema_switching.sql', echo_queries=False)
execute_sql_file('utils/consistency_checks.sql',
echo_queries=False)
# create tmp and dim_next schema
execute_sql_statment('DROP SCHEMA IF EXISTS c_temp CASCADE; CREATE
SCHEMA c_temp;')
execute_sql_statment('DROP SCHEMA IF EXISTS dim_next CASCADE;
CREATE SCHEMA dim_next;')
# preprocess / combine entities
execute_sql_file('contacts/preprocess_contacts_dates.sql')
execute_sql_file('contacts/preprocess_contacts.sql')
execute_sql_file('organisations/preprocess_organizations.sql')
execute_sql_file('products/preprocess_products.sql')
execute_sql_file('deals/preprocess_dealflow.sql')
execute_sql_file('deals/preprocess_deal.sql')
execute_sql_file('marketing/preprocess_website_visitors.sql')
execute_sql_file('marketing/
create_contacts_performance_attribution.sql')
# create reporting schema, establish foreign keys, transfer
aggregates
execute_sql_file('contacts/transform_contacts.sql')
execute_sql_file('organisations/transform_organizations.sql')
execute_sql_file('organisations/create_org_data_set.sql')
execute_sql_file('deals/transform_deal.sql')
execute_sql_file('deals/flatten_deal_fact.sql')
execute_sql_file('deals/create_deal_data_set.sql')
execute_sql_file('marketing/transform_marketing_performance.sql')
execute_sql_file('targets/preprocess_target.sql')
execute_sql_file('targets/transform_target.sql')
execute_sql_file('constrain_tables.sql')
# Consistency checks
execute_sql_file('consistency_checks.sql')
# replace the current version of the reporting schema with the next
execute_sql_statment(
"SELECT util.replace_schema('dim', 'dim_next')")
!11
Simple scripts
@martin_loetzsch
SQL, python & bash

!12
Schedule via Jenkins, Rundeck, etc.
@martin_loetzsch
Such a setup usually enough for a lot of companies

Works well
For simple incremental transformations on large partitioned data sets
For moving data 
Does not work so well
When you have complex business logic/ lot of pipeline branching
When you can’t partition your data well
Weird workarounds needed
For finding out when something went wrong
For running something again (need to manually delete task instances)
For deploying things while pipelines are running
For dynamic dags (decide at run-time what to run) 
 
Usually: BashOperator for everything
High devops complexity
!13
When you have actual big data: Airflow
@martin_loetzsch
Caveat: I know only very few teams that are very productive in using it

Jinja-templated queries, DAGs created from dependencies
with source as (
select * from {{var('ticket_tags_table')}}
),
renamed as (
select
--ids
{{dbt_utils.surrogate_key (
‘_sdc_level_0_id', '_sdc_source_key_id')}}
as tag_id,
_sdc_source_key_id as ticket_id,
--fields
nullif(lower(value), '') as ticket_tag_value
from source
)
select * from renamed 
Additional meta data in YAML files
version: 2
models:
- name: zendesk_organizations
columns:
- name: organization_id
tests:
- unique
- not_null
- name: zendesk_tickets
columns:
- name: ticket_id
tests:
- unique
- not_null
- name: zendesk_users
columns:
- name: user_id
tests:
- unique
- not_null
!14
Decent: DBT (data build tool)
@martin_loetzsch
Prerequisite: Data already in DB, everything can be done in SQL

!15
Mara: Things that worked for us
@martin_loetzsch
https://github.com/mara

Runnable app
Integrates PyPI project download stats with  
Github repo events
!17
Example: Python project stats data warehouse
@martin_loetzsch
https://github.com/mara/mara-example-project

Example pipeline
pipeline = Pipeline(
id="pypi",
description="Builds a PyPI downloads cube using the public ..”)
# ..
pipeline.add(
Task(id=“transform_python_version", description=‘..’,
commands=[
ExecuteSQL(sql_file_name="transform_python_version.sql")
]),
upstreams=['read_download_counts'])
pipeline.add(
ParallelExecuteSQL(
id=“transform_download_counts", description=“..”,
sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);",
parameter_function=etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download_counts.sql")
]),
upstreams=["preprocess_project_version", "transform_installer",
"transform_python_version"])
!18
ETL pipelines as code
@martin_loetzsch
Pipeline = list of tasks with dependencies between them. Task = list of commands

Target of computation
 
CREATE TABLE m_dim_next.region ( 
region_id SMALLINT PRIMARY KEY, 
region_name TEXT NOT NULL UNIQUE, 
country_id SMALLINT NOT NULL, 
country_name TEXT NOT NULL, 
_region_name TEXT NOT NULL 
); 
 
Do computation and store result in table
 
WITH raw_region
AS (SELECT DISTINCT
country, 
region 
FROM m_data.ga_session 
ORDER BY country, region) 
 
INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id, 
CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1; 
INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown'); 
Speedup subsequent transformations
 
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']); 
 
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']); 
 
ANALYZE m_dim_next.region;
!19
PostgreSQL as a data processing engine
@martin_loetzsch
Leave data in DB, Tables as (intermediate) results of processing steps

Execute query
 
ExecuteSQL(sql_file_name=“preprocess-ad.sql")
cat app/data_integration/pipelines/facebook/preprocess-ad.sql
| PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning
psql --username=mloetzsch --host=localhost --echo-all
-—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
Read file
 
ReadFile(file_name=“country_iso_code.csv",
compression=Compression.NONE,
target_table="os_data.country_iso_code",
mapper_script_file_name=“read-country-iso-codes.py",
delimiter_char=“;")
cat "dwh-data/country_iso_code.csv"
| .venv/bin/python3.6 "app/data_integration/pipelines/load_data/
read-country-iso-codes.py"
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
--command="COPY os_data.country_iso_code FROM STDIN WITH CSV
DELIMITER AS ';'" 
Copy from other databases
 
Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm",
target_table=“os_data.product",
replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps",
"@@client@@": "kfzteile24 GmbH"}) 
cat app/data_integration/pipelines/load_data/pdm/load-product.sql
| sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/
kfzteile24 GmbH/g"
| sed 's/$/$/g;s/$/$/g' | (cat && echo ';')
| (cat && echo ';
go')
| sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
--command="COPY os_data.product FROM STDIN WITH CSV HEADER"
!20
Shell commands as interface to data & DBs
@martin_loetzsch
Nothing is faster than a unix pipe

Read a set of files
 
pipeline.add(
ParallelReadFile(
id="read_download",
description="Loads PyPI downloads from pre_downloaded csv
files",
file_pattern="*/*/*/pypi/downloads-v1.csv.gz",
read_mode=ReadMode.ONLY_NEW,
compression=Compression.GZIP,
target_table="pypi_data.download",
delimiter_char="t", skip_header=True, csv_format=True,
file_dependencies=read_download_file_dependencies,
date_regex="^(?P<year>d{4})/(?P<month>d{2})/(?
P<day>d{2})/",
partition_target_table_by_day_id=True,
timezone="UTC",
commands_before=[
ExecuteSQL(
sql_file_name="create_download_data_table.sql",
file_dependencies=read_download_file_dependencies)
]))
Split large joins into chunks
pipeline.add(
ParallelExecuteSQL(
id="transform_download",
description="Maps downloads to their dimensions",
sql_statement="SELECT
pypi_tmp.insert_download(@chunk@::SMALLINT);”,
parameter_function=
etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download.sql")
]),
upstreams=["preprocess_project_version",
"transform_installer"])
!21
Incremental & parallel processing
@martin_loetzsch
You can’t join all clicks with all customers at once

!22
(Data) security
@martin_loetzsch

Always: At least two out of
VPN
IP restriction
SSH tunnel
Don’t rely on passwords, they tend to be shared and are not
changed when someone leaves the company
SSH keys
Single sign-on
!23
Please don’t put your data on the internet
@martin_loetzsch
Also: GDPR

Authenticates each incoming request against auth provider   We run it in front all web interfaces
Including Tableau, Jenkins, Metabase
Many auth providers: Google, Azure AD, etc.
Much better than passwords!
!24
SSO with oauth2_proxy
@martin_loetzsch
https://github.com/pusher/oauth2_proxy

Implemented using the auth_request directive in nginx
server {
listen 443 default ssl;
include ssl.conf;
location /oauth2/ {
proxy_pass http://127.0.0.1:4180;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Scheme $scheme;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Auth-Request-Redirect $request_uri;
}
location = /oauth2/auth {
proxy_pass http://127.0.0.1:4180;
# nginx auth_request includes headers but not body
proxy_set_header Content-Length "";
proxy_pass_request_body off;
}
location / {
auth_request /oauth2/auth;
error_page 401 = /oauth2/sign_in;
auth_request_set $email
$upstream_http_x_auth_request_email;
proxy_set_header X-Forwarded-Email $email;
auth_request_set $auth_cookie $upstream_http_set_cookie;
add_header Set-Cookie $auth_cookie;
proxy_send_timeout 600;
proxy_read_timeout 600;
proxy_buffering off;
send_timeout 600;
# pass to downstreams
proxy_pass http://127.0.0.1:81;
}
access_log /bi/logs/nginx/access-443-default.log;
error_log /bi/logs/nginx/error-443-default.log;
}
!25
SSO with oauth2_proxy II
@martin_loetzsch
https://github.com/pusher/oauth2_proxy

!26
Event collection
@martin_loetzsch

Ad Blockers
 
 
 
 
 
 
 
 
 
 
 
 
Blind Spots
Not all user interactions happen online (returns, call center requests)
Some data can not be leaked to pixel: segments, prices, backend logic
Solution: server side tracking
Collect events on the server rather on the client
 
 
 
 
 
Use cases & advantages
Ground truth: correct metrics recorded in marketing tools
Price: provide a cheaper alternative to GA Premium / Segment etc.
GDPR compliance: own data, avoid black boxes
Product analytics: Understand bottlenecks in user journey
Unified user user journey: combine events from multiple touchpoints
Site speed: Front-ends are not slowed down by analytics pixels
SEO: Measure site indexing by search engines
!27
Pixel based tracking is dead
@martin_loetzsch
Server side tracking: surprisingly easy

!28
Works: Kinesis + AWS Lamda + S3 + BigQuery
@martin_loetzsch
Technologies don’t matter: would also work with Google cloud & Azure
User Project A
Website
AWS Kinesis
Firehose
AWS Lambda
Amazon S3
Google
BigQuery

Everything the backend knows about the user, the context, the content
{
"visitor_id": "10d1fa9c9fd39cff44c88bd551b1ab4dfe92b3da",
"session_id": "9tv1phlqkl5kchajmb9k2j2434",
"timestamp": "2018-12-16T16:03:04+00:00",
"ip": "92.195.48.163",
"url": "https://www.project-a.com/en/career/jobs/data-engineer-data-scientist-m-f-d-4072020002?gh_jid=4082751002&gh_src=9fcd30182&&utm_medium=social&u
"host": "www.project-a.com",
"path": [
"en",
"career",
"jobs",
"data-engineer-data-scientist-m-f-d-4072020002"
],
"query": {
"gh_jid": "4082751002",
"gh_src": "9fcd30182",
"utm_medium": "social",
"utm_source": "linkedin",
"utm_campaign": "buffer"
},
"referrer": null,
"language": "en",
"ua": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
}
!29
Backend collects user interaction data
@martin_loetzsch
And sends it asynchronously to a queue

Kirby tracking plugin for Project A web site
<?php
require 'vendor/autoload.php';
// this cookie is set when not present
$cookieName = 'visitor';
// retrieve visitor id from cookie
$visitorId = array_key_exists($cookieName, $_COOKIE) ?
$_COOKIE[$cookieName] : null;
if (!$visitorId) {
// visitor cookie not set. Use session id as visitor ID
$visitorId = sha1(uniqid(s::id(), true));
setcookie($cookieName, $visitorId,
time() + (2 * 365 * 24 * 60 * 60), '/',
'project-a.com', false, true);
}
$request = kirby()->request();
// the payload to log
$event = [
'visitor_id' => $visitorId,
'session_id' => s::id(),
'timestamp' => date('c'),
'ip' => $request->__debuginfo()[‘ip'],
'url' => $request->url(),
'host' => $_SERVER['SERVER_NAME'],
'path' => $request->path()->toArray(),
'query' => $request->query()->toArray(),
'referrer' => visitor::referer(),
'language' => visitor::acceptedLanguageCode(),
'ua' => visitor::userAgent()
];
$firehoseClient = new AwsFirehoseFirehoseClient([
// secrets
]);
// publish message to firehose delivery stream
$promise = $firehoseClient->putRecordAsync([
'DeliveryStreamName' => ‘kinesis-firehose-stream-123',
'Record' => ['Data' => json_encode($event)]
]);
register_shutdown_function(function () use ($promise) {
$promise->wait();
});
!30
Custom implementation for each web site
@martin_loetzsch
Good news: there is usually already code that populates the data layer

Result: enhanced with GEO information & device detection
{
"visitor_id": "10d1fa9c9fd39cff44c88bd551b1ab4dfe92b3da",
"session_id": "9tv1phlqkl5kchajmb9k2j2434",
"timestamp": "2018-12-16T16:03:04+00:00",
"url": "https://www.project-a.com/en/career/jobs/data-engineer-
data-scientist-m-f-d-4072020002?
gh_jid=4082751002&gh_src=9fcd30182&&utm_medium=social&utm_source=li
nkedin&utm_campaign=buffer",
"host": "www.project-a.com",
"path": [
"en",
"career",
"jobs",
"data-engineer-data-scientist-m-f-d-4072020002"
],
"query": [
{
"param": "gh_jid",
"value": "4082751002"
},
{
"param": "gh_src",
"value": "9fcd30182"
},
{
"param": "utm_medium",
"value": "social"
},
{
"param": "utm_source",
"value": "linkedin"
},
{
"param": "utm_campaign",
"value": "buffer"
}
],
"referrer": null,
"language": "en",
"browser_family": "Chrome",
"browser_version": "70",
"os_family": "Mac OS X",
"os_version": "10",
"device_brand": null,
"device_model": null,
"country_iso_code": "DE",
"country_name": "Germany",
"subdivisions_iso_code": "SN",
"subdivisions_name": "Saxony",
"city_name": "Dresden"
}
!31
Lambda function transforms and stores events
@martin_loetzsch
GDPR: remove IP address, user agent

Lambda function part I
import base64
import functools
import json
import geoip2.database
from google.cloud import bigquery
from ua_parser import user_agent_parser
@functools.lru_cache(maxsize=None)
def get_geo_db():
return geoip2.database.Reader('./GeoLite2-City_20181002/
GeoLite2-City.mmdb')
def extract_geo_data(ip):
"""Does a geo lookup for an IP address"""
response = get_geo_db().city(ip)
return {
'country_iso_code': response.country.iso_code,
'country_name': response.country.name,
'subdivisions_iso_code':
response.subdivisions.most_specific.iso_code,
'subdivisions_name':
response.subdivisions.most_specific.name,
'city_name': response.city.name
}
def parse_user_agent(user_agent_string):
"""Extracts browser, OS and device information from an user
agent"""
result = user_agent_parser.Parse(user_agent_string)
return {
'browser_family': result['user_agent']['family'],
'browser_version': result['user_agent']['major'],
'os_family': result['os']['family'],
'os_version': result['os']['major'],
'device_brand': result['device']['brand'],
'device_model': result['device']['model']
}
!32
Geo Lookup & device detection
@martin_loetzsch
Solved problem, very good open source libraries exist

Lambda function part II
def lambda_handler(event, context):
lambda_output_records = []
rows_for_biguery = []
bq_client = bigquery.Client.from_service_account_json(
‘big-query-credendials.json’
)
for record in event['records']:
message = json.loads(base64.b64decode(record['data']))
# extract browser, device, os
if message['ua']:
message.update(parse_user_agent(message['ua']))
del message['ua']
# geo lookup for ip address
message.update(extract_geo_data(message['ip']))
del message['ip']
# update get parameters
if message['query']:
message['query'] = [
{'param': param, 'value': value}
for param, value in message[‘query’].items()
]
rows_for_biguery.append(message)
lambda_output_records.append({
'recordId': record['recordId'],
'result': 'Ok',
‘data': base64.b64encode(
json.dumps(message).encode('utf-8')).decode('utf-8')
})
errors = bq_client.insert_rows(
bq_client.get_table(
bq_client.dataset(‘server_side_tracking')
.table(‘project_a_website_events’)
),
rows_for_biguery)
if errors != []:
raise Exception(json.dumps(errors))
return {
"statusCode": 200,
“records": lambda_output_records
}
!33
Send events to BigQuery & S3
@martin_loetzsch
Also possible: send to Google Analytics, Mixpanel, Segment, Heap etc.

!34
Thank you!
@martin_loetzsch
Questions?

!35
Bonus track: consistency & correctness
@martin_loetzsch

It’s easy to make mistakes during ETL
 
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s; 
 
CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
); 
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary'); 
 
CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
); 
INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3); 
Customers per country?
 
SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city  
ON customer.city_fk = s.city.city_id
GROUP BY country_name; 
 
Back up all assumptions about data by constraints
 
ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);
 
ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated. 
ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates
foreign key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
!36
Referential consistency
@martin_loetzsch
Only very little overhead, will save your ass

10/18/2017 2017-10-18-dwh-schema-pav.svg
customer
customer_id
ﬁrst_order_fk
favourite_product_fk
lifetime_revenue
product
product_id
revenue_last_6_months
order
order_id
processed_order_id
customer_fk
product_fk
revenue
Never repeat “business logic”
 
SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed', 
'proposal_for_change');
 
 
SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order; 
 
SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order; 
 
 
 
 
Refactor pipeline
Create separate task that computes everything we know about an order
Usually difficult in real life 
 
 
 
 
 
Load → preprocess → transform → flatten-fact
!37
Computational consistency
@martin_loetzsch
Requires discipline
load-product load-order load-customer
preprocess-product preprocess-order preprocess-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact

CREATE FUNCTION m_tmp.normalize_utm_source(TEXT)
RETURNS TEXT AS $$
SELECT
CASE
WHEN $1 LIKE '%.%' THEN lower($1)
WHEN $1 = '(direct)' THEN 'Direct'
WHEN $1 LIKE 'Untracked%' OR $1 LIKE '(%)'
THEN $1
ELSE initcap($1)
END;
$$ LANGUAGE SQL IMMUTABLE; 
CREATE FUNCTION util.norm_phone_number(phone_number TEXT)
RETURNS TEXT AS $$
BEGIN
phone_number := TRIM(phone_number);
phone_number := regexp_replace(phone_number, '(0)', '');
phone_number
:= regexp_replace(phone_number, '[^[:digit:]]', '', 'g');
phone_number
:= regexp_replace(phone_number, '^(+49|0049|49)', '0');
phone_number := regexp_replace(phone_number, '^(00)', '');
phone_number := COALESCE(phone_number, '');
RETURN phone_number;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE; 
CREATE FUNCTION m_tmp.compute_ad_id(id BIGINT, api m_tmp.API)
RETURNS BIGINT AS $$
-- creates a collision free ad id from an id in a source system
SELECT ((CASE api
WHEN 'adwords' THEN 1
WHEN 'bing' THEN 2
WHEN 'criteo' THEN 3
WHEN 'facebook' THEN 4
WHEN 'backend' THEN 5
END) * 10 ^ 18) :: BIGINT + id
$$ LANGUAGE SQL IMMUTABLE; 
 
 
CREATE FUNCTION pv.date_to_supplier_period_start(INTEGER)
RETURNS INTEGER AS $$
-- this maps all dates to either a integer which is included
-- in lieferantenrabatt.period_start or
-- null (meaning we don't have a lieferantenrabatt for it)
SELECT
CASE
WHEN $1 >= 20170501 THEN 20170501
WHEN $1 >= 20151231 THEN 20151231
ELSE 20151231
END;
$$ LANGUAGE SQL IMMUTABLE;
!38
When not possible: use functions
@martin_loetzsch
Almost no performance overhead

Check for “lost” rows
 
SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact'); 
 
 
 
Check consistency across cubes / domains
 
SELECT util.assert_almost_equal(
'The number of first orders should be the same in ' 
|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;', 
'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
1.0
); 
Check completeness of source data
 
SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days'''); 
 
Check correctness of redistribution transformations
 
SELECT util.assert_almost_equal_relative(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
!39
Data consistency checks
@martin_loetzsch
Makes changing things easy

Execute queries and compare results 
CREATE FUNCTION util.assert(description TEXT, query TEXT)
RETURNS BOOLEAN AS $$
DECLARE
succeeded BOOLEAN;
BEGIN
EXECUTE query INTO succeeded;
IF NOT succeeded THEN RAISE EXCEPTION 'assertion failed:
# % #
%', description, query;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql'; 
 
 
 
CREATE FUNCTION util.assert_almost_equal_relative(
description TEXT, query1 TEXT,
query2 TEXT, percentage DECIMAL)
RETURNS BOOLEAN AS $$
DECLARE
result1 NUMERIC;
result2 NUMERIC;
succeeded BOOLEAN;
BEGIN
EXECUTE query1 INTO result1;
EXECUTE query2 INTO result2;
EXECUTE 'SELECT abs(' || result2 || ' - ' || result1 || ') / '
|| result1 || ' < ' || percentage INTO succeeded;
IF NOT succeeded THEN RAISE WARNING '%
assertion failed: abs(% - %) / % < %
%: (%)
%: (%)', description, result2, result1, result1, percentage,
result1, query1, result2, query2;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql';
!40
Consistency check functions
@martin_loetzsch
Also: assert_not_found, assert_equal_table, assert_smaller_than_or_equal

Yes, unit tests
SELECT util.assert_value_equal('test_german_number_with_country_prefix', util.norm_phone_number('00491234'), '01234');
SELECT util.assert_value_equal('test_german_number_not_to_be_confused_with_country_prefix', util.norm_phone_number('0491234'), '0491234');
SELECT util.assert_value_equal('test_non_german_number_with_plus', util.norm_phone_number('+44 1234'), '441234');
SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero', util.norm_phone_number('+49 (0)1234'), '01234');
SELECT util.assert_value_equal('test__trim', util.norm_phone_number(' 0491234 '), '0491234');
SELECT util.assert_value_equal('test_number_with_leading_wildcard_symbol', util.norm_phone_number('*+436504834933'), '436504834933');
SELECT util.assert_value_equal('test_NULL', util.norm_phone_number(NULL), '');
SELECT util.assert_value_equal('test_empty', util.norm_phone_number(''), '');
SELECT util.assert_value_equal('test_wildcard_only', util.norm_phone_number('*'), '');
SELECT util.assert_value_equal('test_foreign_number_with_two_leading_zeroes', util.norm_phone_number('*00436769553701'), '436769553701');
SELECT util.assert_value_equal('test_domestic_number_with_trailing_letters', util.norm_phone_number('017678402HORST'), '017678402');
SELECT util.assert_value_equal('test_domestic_number_with_leading_letters', util.norm_phone_number('HORST017678402'), '017678402');
SELECT util.assert_value_equal('test_domestic_number_with_letters_in_between', util.norm_phone_number('0H1O7R6S7T8402'), '017678402');
SELECT util.assert_value_equal('test_german_number_with_country_prefix_and_leading_letters',
util.norm_phone_number('HORST00491234'), '01234');
SELECT util.assert_value_equal(‘test_german_number_not_to_be_confused_with_country_prefix_and_leading_letters',
util.norm_phone_number('HORST0491234'), '0491234');
SELECT util.assert_value_equal('test_non_german_number_with_plus_and_leading_letters', util.norm_phone_number('HORST+44 1234'), '441234');
SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero_and_leading_letters',
util.norm_phone_number('HORST+49 (0)1234'), ‘01234');
!41
Unit tests
@martin_loetzsch
People enter horrible telephone numbers into websites

Contribution margin 3a
SELECT order_item_id,
((((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((COALESCE(item_net_purchase_price, 0)::REAL
+ COALESCE(alcohol_tax, 0)::REAL)
+ COALESCE(import_tax, 0)::REAL))
- (COALESCE(net_fulfillment_costs, 0)::REAL
+ COALESCE(net_payment_costs, 0)::REAL))
- COALESCE(net_return_costs, 0)::REAL)
- ((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue, 0)::REAL)
- COALESCE(voucher_gross_amount, 0)::REAL)
* (1 - ((COALESCE(item_tax_amount, 0)::REAL
+ (COALESCE(gross_shipping_revenue, 0)::REAL
- COALESCE(net_shipping_revenue, 0)::REAL))
/ NULLIF(((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue,
0)::REAL), 0))))))
- COALESCE(goodie_cost_per_item, 0)::REAL) :: DOUBLE PRECISION
AS "Contribution margin 3a"
FROM dim.sales_fact;
Use schemas between reporting and database
Mondrian
LookerML
your own
Or: Pre-compute metrics in database
!42
Semantic consistency
@martin_loetzsch
Changing the meaning of metrics across all dashboards needs to be easy

Data infrastructure for the other 90% of companies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data infrastructure for the other 90% of companies

Similar to Data infrastructure for the other 90% of companies (20)

Recently uploaded

Recently uploaded (20)

Data infrastructure for the other 90% of companies