SlideShare a Scribd company logo
1 of 42
Download to read offline
@martin_loetzsch
Dr. Martin Loetzsch
Data Council Meetup Kickoff in Berlin
May 2019
Data infrastructure
for the other 90% of companies
!2
Which technology?
@martin_loetzsch
!3
Your are not Google
@martin_loetzsch
You are also not Amazon, LinkedIn, Uber etc.
https://www.slideshare.net/zhenxiao/real-time-analytics-at-uber-
strata-data-2019
https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
!4
Hipster technologies
@martin_loetzsch
Say no sometimes
All the data of the company in one place


Data is
the single source of truth
easy to access
documented
embedded into the organisation
Integration of different domains





















Main challenges
Consistency & correctness
Changeability
Complexity
Transparency
!5
Data warehouse = integrated data
@martin_loetzsch
Nowadays required for running a business
application
databases
events
csv files
apis
reporting
crm
marketing
…
search
pricing
DWH orders
users
products
price 

histories
emails
clicks
…
…
operation

events
!6
Data Engineering
@martin_loetzsch
!7
Picking a database
@martin_loetzsch
Warehouse adoption from 2016 to today
























Based on three years of segment.io customer adoption
(https://twitter.com/segment/status/1125891660800413697)

BigQuery
When you have terabytes of stuff
For raw event storage & processing


Snowflake
When you are not able to run a database yourself
When you can’t write efficient queries
Redshift
Can’t recommend for ETL
When OLAP performance becomes a problem
ClickHouse
For very fast OLAP
!8
If in doubt, use PostgreSQL
@martin_loetzsch
Boring, battle-tested work horse
!9
ETL/ Workflows
@martin_loetzsch
Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code 

Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
Functional data engineering
Reproducability, idempotency, (immutability).
Running a pipeline on the same inputs will always produce the same
results. No side effects.
https://medium.com/@maximebeauchemin/functional-data-engineering-
a-modern-paradigm-for-batch-data-processing-2327ec32c42a
!10
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices
from execute_sql import execute_sql_file, execute_sql_statment
# create utility functions in PostgreSQL
execute_sql_statment('DROP SCHEMA IF EXISTS util CASCADE; CREATE
SCHEMA util;')
execute_sql_statment('CREATE EXTENSION IF NOT EXISTS pg_trgm;')
execute_sql_file('utils/indexes_and_constraints.sql',
echo_queries=False)
execute_sql_file('utils/schema_switching.sql', echo_queries=False)
execute_sql_file('utils/consistency_checks.sql',
echo_queries=False)
# create tmp and dim_next schema
execute_sql_statment('DROP SCHEMA IF EXISTS c_temp CASCADE; CREATE
SCHEMA c_temp;')
execute_sql_statment('DROP SCHEMA IF EXISTS dim_next CASCADE;
CREATE SCHEMA dim_next;')
# preprocess / combine entities
execute_sql_file('contacts/preprocess_contacts_dates.sql')
execute_sql_file('contacts/preprocess_contacts.sql')
execute_sql_file('organisations/preprocess_organizations.sql')
execute_sql_file('products/preprocess_products.sql')
execute_sql_file('deals/preprocess_dealflow.sql')
execute_sql_file('deals/preprocess_deal.sql')
execute_sql_file('marketing/preprocess_website_visitors.sql')
execute_sql_file('marketing/
create_contacts_performance_attribution.sql')
# create reporting schema, establish foreign keys, transfer
aggregates
execute_sql_file('contacts/transform_contacts.sql')
execute_sql_file('organisations/transform_organizations.sql')
execute_sql_file('organisations/create_org_data_set.sql')
execute_sql_file('deals/transform_deal.sql')
execute_sql_file('deals/flatten_deal_fact.sql')
execute_sql_file('deals/create_deal_data_set.sql')
execute_sql_file('marketing/transform_marketing_performance.sql')
execute_sql_file('targets/preprocess_target.sql')
execute_sql_file('targets/transform_target.sql')
execute_sql_file('constrain_tables.sql')
# Consistency checks
execute_sql_file('consistency_checks.sql')
# replace the current version of the reporting schema with the next
execute_sql_statment(
"SELECT util.replace_schema('dim', 'dim_next')")
!11
Simple scripts
@martin_loetzsch
SQL, python & bash
!12
Schedule via Jenkins, Rundeck, etc.
@martin_loetzsch
Such a setup usually enough for a lot of companies
Works well
For simple incremental transformations on large partitioned data sets
For moving data

Does not work so well
When you have complex business logic/ lot of pipeline branching
When you can’t partition your data well
Weird workarounds needed
For finding out when something went wrong
For running something again (need to manually delete task instances)
For deploying things while pipelines are running
For dynamic dags (decide at run-time what to run)



Usually: BashOperator for everything
High devops complexity
!13
When you have actual big data: Airflow
@martin_loetzsch
Caveat: I know only very few teams that are very productive in using it
Jinja-templated queries, DAGs created from dependencies
with source as (
select * from {{var('ticket_tags_table')}}
),
renamed as (
select
--ids
{{dbt_utils.surrogate_key (
‘_sdc_level_0_id', '_sdc_source_key_id')}}
as tag_id,
_sdc_source_key_id as ticket_id,
--fields
nullif(lower(value), '') as ticket_tag_value
from source
)
select * from renamed

Additional meta data in YAML files
version: 2
models:
- name: zendesk_organizations
columns:
- name: organization_id
tests:
- unique
- not_null
- name: zendesk_tickets
columns:
- name: ticket_id
tests:
- unique
- not_null
- name: zendesk_users
columns:
- name: user_id
tests:
- unique
- not_null
!14
Decent: DBT (data build tool)
@martin_loetzsch
Prerequisite: Data already in DB, everything can be done in SQL
!15
Mara: Things that worked for us
@martin_loetzsch
https://github.com/mara
!16
Mara
@martin_loetzsch
Runnable app
Integrates PyPI project download stats with 

Github repo events
!17
Example: Python project stats data warehouse
@martin_loetzsch
https://github.com/mara/mara-example-project
Example pipeline
pipeline = Pipeline(
id="pypi",
description="Builds a PyPI downloads cube using the public ..”)
# ..
pipeline.add(
Task(id=“transform_python_version", description=‘..’,
commands=[
ExecuteSQL(sql_file_name="transform_python_version.sql")
]),
upstreams=['read_download_counts'])
pipeline.add(
ParallelExecuteSQL(
id=“transform_download_counts", description=“..”,
sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);",
parameter_function=etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download_counts.sql")
]),
upstreams=["preprocess_project_version", "transform_installer",
"transform_python_version"])
!18
ETL pipelines as code
@martin_loetzsch
Pipeline = list of tasks with dependencies between them. Task = list of commands
Target of computation


CREATE TABLE m_dim_next.region (

region_id SMALLINT PRIMARY KEY,

region_name TEXT NOT NULL UNIQUE,

country_id SMALLINT NOT NULL,

country_name TEXT NOT NULL,

_region_name TEXT NOT NULL

);



Do computation and store result in table


WITH raw_region
AS (SELECT DISTINCT
country,

region

FROM m_data.ga_session

ORDER BY country, region)



INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,

CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;

INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');

Speedup subsequent transformations


SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);



SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);



ANALYZE m_dim_next.region;
!19
PostgreSQL as a data processing engine
@martin_loetzsch
Leave data in DB, Tables as (intermediate) results of processing steps
Execute query


ExecuteSQL(sql_file_name=“preprocess-ad.sql")
cat app/data_integration/pipelines/facebook/preprocess-ad.sql 
| PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
-—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl

Read file


ReadFile(file_name=“country_iso_code.csv",
compression=Compression.NONE,
target_table="os_data.country_iso_code",
mapper_script_file_name=“read-country-iso-codes.py",
delimiter_char=“;")
cat "dwh-data/country_iso_code.csv" 
| .venv/bin/python3.6 "app/data_integration/pipelines/load_data/
read-country-iso-codes.py" 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.country_iso_code FROM STDIN WITH CSV
DELIMITER AS ';'"

Copy from other databases


Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm",
target_table=“os_data.product",
replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps",
"@@client@@": "kfzteile24 GmbH"})

cat app/data_integration/pipelines/load_data/pdm/load-product.sql 
| sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/
kfzteile24 GmbH/g" 
| sed 's/$/$/g;s/$/$/g' | (cat && echo ';') 
| (cat && echo ';
go') 
| sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.product FROM STDIN WITH CSV HEADER"
!20
Shell commands as interface to data & DBs
@martin_loetzsch
Nothing is faster than a unix pipe
Read a set of files


pipeline.add(
ParallelReadFile(
id="read_download",
description="Loads PyPI downloads from pre_downloaded csv
files",
file_pattern="*/*/*/pypi/downloads-v1.csv.gz",
read_mode=ReadMode.ONLY_NEW,
compression=Compression.GZIP,
target_table="pypi_data.download",
delimiter_char="t", skip_header=True, csv_format=True,
file_dependencies=read_download_file_dependencies,
date_regex="^(?P<year>d{4})/(?P<month>d{2})/(?
P<day>d{2})/",
partition_target_table_by_day_id=True,
timezone="UTC",
commands_before=[
ExecuteSQL(
sql_file_name="create_download_data_table.sql",
file_dependencies=read_download_file_dependencies)
]))
Split large joins into chunks
pipeline.add(
ParallelExecuteSQL(
id="transform_download",
description="Maps downloads to their dimensions",
sql_statement="SELECT
pypi_tmp.insert_download(@chunk@::SMALLINT);”,
parameter_function=
etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download.sql")
]),
upstreams=["preprocess_project_version",
"transform_installer"])
!21
Incremental & parallel processing
@martin_loetzsch
You can’t join all clicks with all customers at once
!22
(Data) security
@martin_loetzsch
Always: At least two out of
VPN
IP restriction
SSH tunnel
Don’t rely on passwords, they tend to be shared and are not
changed when someone leaves the company
SSH keys
Single sign-on
!23
Please don’t put your data on the internet
@martin_loetzsch
Also: GDPR
Authenticates each incoming request against auth provider 
 We run it in front all web interfaces
Including Tableau, Jenkins, Metabase
Many auth providers: Google, Azure AD, etc.
Much better than passwords!
!24
SSO with oauth2_proxy
@martin_loetzsch
https://github.com/pusher/oauth2_proxy
Implemented using the auth_request directive in nginx
server {
listen 443 default ssl;
include ssl.conf;
location /oauth2/ {
proxy_pass http://127.0.0.1:4180;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Scheme $scheme;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Auth-Request-Redirect $request_uri;
}
location = /oauth2/auth {
proxy_pass http://127.0.0.1:4180;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Scheme $scheme;
proxy_set_header X-Forwarded-Proto $scheme;
# nginx auth_request includes headers but not body
proxy_set_header Content-Length "";
proxy_pass_request_body off;
}
location / {
auth_request /oauth2/auth;
error_page 401 = /oauth2/sign_in;
auth_request_set $email
$upstream_http_x_auth_request_email;
proxy_set_header X-Forwarded-Email $email;
auth_request_set $auth_cookie $upstream_http_set_cookie;
add_header Set-Cookie $auth_cookie;
proxy_set_header Host $host;
proxy_set_header X-Scheme $scheme;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_send_timeout 600;
proxy_read_timeout 600;
proxy_buffering off;
send_timeout 600;
# pass to downstreams
proxy_pass http://127.0.0.1:81;
}
access_log /bi/logs/nginx/access-443-default.log;
error_log /bi/logs/nginx/error-443-default.log;
}
!25
SSO with oauth2_proxy II
@martin_loetzsch
https://github.com/pusher/oauth2_proxy
!26
Event collection
@martin_loetzsch
Ad Blockers
























Blind Spots
Not all user interactions happen online (returns, call center requests)
Some data can not be leaked to pixel: segments, prices, backend logic
Solution: server side tracking
Collect events on the server rather on the client










Use cases & advantages
Ground truth: correct metrics recorded in marketing tools
Price: provide a cheaper alternative to GA Premium / Segment etc.
GDPR compliance: own data, avoid black boxes
Product analytics: Understand bottlenecks in user journey
Unified user user journey: combine events from multiple touchpoints
Site speed: Front-ends are not slowed down by analytics pixels
SEO: Measure site indexing by search engines
!27
Pixel based tracking is dead
@martin_loetzsch
Server side tracking: surprisingly easy
!28
Works: Kinesis + AWS Lamda + S3 + BigQuery
@martin_loetzsch
Technologies don’t matter: would also work with Google cloud & Azure
User Project A
Website
AWS Kinesis
Firehose
AWS Lambda
Amazon S3
Google
BigQuery
Everything the backend knows about the user, the context, the content
{
"visitor_id": "10d1fa9c9fd39cff44c88bd551b1ab4dfe92b3da",
"session_id": "9tv1phlqkl5kchajmb9k2j2434",
"timestamp": "2018-12-16T16:03:04+00:00",
"ip": "92.195.48.163",
"url": "https://www.project-a.com/en/career/jobs/data-engineer-data-scientist-m-f-d-4072020002?gh_jid=4082751002&gh_src=9fcd30182&&utm_medium=social&u
"host": "www.project-a.com",
"path": [
"en",
"career",
"jobs",
"data-engineer-data-scientist-m-f-d-4072020002"
],
"query": {
"gh_jid": "4082751002",
"gh_src": "9fcd30182",
"utm_medium": "social",
"utm_source": "linkedin",
"utm_campaign": "buffer"
},
"referrer": null,
"language": "en",
"ua": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
}
!29
Backend collects user interaction data
@martin_loetzsch
And sends it asynchronously to a queue
Kirby tracking plugin for Project A web site
<?php
require 'vendor/autoload.php';
// this cookie is set when not present
$cookieName = 'visitor';
// retrieve visitor id from cookie
$visitorId = array_key_exists($cookieName, $_COOKIE) ?
$_COOKIE[$cookieName] : null;
if (!$visitorId) {
// visitor cookie not set. Use session id as visitor ID
$visitorId = sha1(uniqid(s::id(), true));
setcookie($cookieName, $visitorId,
time() + (2 * 365 * 24 * 60 * 60), '/',
'project-a.com', false, true);
}
$request = kirby()->request();
// the payload to log
$event = [
'visitor_id' => $visitorId,
'session_id' => s::id(),
'timestamp' => date('c'),
'ip' => $request->__debuginfo()[‘ip'],
'url' => $request->url(),
'host' => $_SERVER['SERVER_NAME'],
'path' => $request->path()->toArray(),
'query' => $request->query()->toArray(),
'referrer' => visitor::referer(),
'language' => visitor::acceptedLanguageCode(),
'ua' => visitor::userAgent()
];
$firehoseClient = new AwsFirehoseFirehoseClient([
// secrets
]);
// publish message to firehose delivery stream
$promise = $firehoseClient->putRecordAsync([
'DeliveryStreamName' => ‘kinesis-firehose-stream-123',
'Record' => ['Data' => json_encode($event)]
]);
register_shutdown_function(function () use ($promise) {
$promise->wait();
});
!30
Custom implementation for each web site
@martin_loetzsch
Good news: there is usually already code that populates the data layer
Result: enhanced with GEO information & device detection
{
"visitor_id": "10d1fa9c9fd39cff44c88bd551b1ab4dfe92b3da",
"session_id": "9tv1phlqkl5kchajmb9k2j2434",
"timestamp": "2018-12-16T16:03:04+00:00",
"url": "https://www.project-a.com/en/career/jobs/data-engineer-
data-scientist-m-f-d-4072020002?
gh_jid=4082751002&gh_src=9fcd30182&&utm_medium=social&utm_source=li
nkedin&utm_campaign=buffer",
"host": "www.project-a.com",
"path": [
"en",
"career",
"jobs",
"data-engineer-data-scientist-m-f-d-4072020002"
],
"query": [
{
"param": "gh_jid",
"value": "4082751002"
},
{
"param": "gh_src",
"value": "9fcd30182"
},
{
"param": "utm_medium",
"value": "social"
},
{
"param": "utm_source",
"value": "linkedin"
},
{
"param": "utm_campaign",
"value": "buffer"
}
],
"referrer": null,
"language": "en",
"browser_family": "Chrome",
"browser_version": "70",
"os_family": "Mac OS X",
"os_version": "10",
"device_brand": null,
"device_model": null,
"country_iso_code": "DE",
"country_name": "Germany",
"subdivisions_iso_code": "SN",
"subdivisions_name": "Saxony",
"city_name": "Dresden"
}
!31
Lambda function transforms and stores events
@martin_loetzsch
GDPR: remove IP address, user agent
Lambda function part I
import base64
import functools
import json
import geoip2.database
from google.cloud import bigquery
from ua_parser import user_agent_parser
@functools.lru_cache(maxsize=None)
def get_geo_db():
return geoip2.database.Reader('./GeoLite2-City_20181002/
GeoLite2-City.mmdb')
def extract_geo_data(ip):
"""Does a geo lookup for an IP address"""
response = get_geo_db().city(ip)
return {
'country_iso_code': response.country.iso_code,
'country_name': response.country.name,
'subdivisions_iso_code':
response.subdivisions.most_specific.iso_code,
'subdivisions_name':
response.subdivisions.most_specific.name,
'city_name': response.city.name
}
def parse_user_agent(user_agent_string):
"""Extracts browser, OS and device information from an user
agent"""
result = user_agent_parser.Parse(user_agent_string)
return {
'browser_family': result['user_agent']['family'],
'browser_version': result['user_agent']['major'],
'os_family': result['os']['family'],
'os_version': result['os']['major'],
'device_brand': result['device']['brand'],
'device_model': result['device']['model']
}
!32
Geo Lookup & device detection
@martin_loetzsch
Solved problem, very good open source libraries exist
Lambda function part II
def lambda_handler(event, context):
lambda_output_records = []
rows_for_biguery = []
bq_client = bigquery.Client.from_service_account_json(
‘big-query-credendials.json’
)
for record in event['records']:
message = json.loads(base64.b64decode(record['data']))
# extract browser, device, os
if message['ua']:
message.update(parse_user_agent(message['ua']))
del message['ua']
# geo lookup for ip address
message.update(extract_geo_data(message['ip']))
del message['ip']
# update get parameters
if message['query']:
message['query'] = [
{'param': param, 'value': value}
for param, value in message[‘query’].items()
]
rows_for_biguery.append(message)
lambda_output_records.append({
'recordId': record['recordId'],
'result': 'Ok',
‘data': base64.b64encode(
json.dumps(message).encode('utf-8')).decode('utf-8')
})
errors = bq_client.insert_rows(
bq_client.get_table(
bq_client.dataset(‘server_side_tracking') 
.table(‘project_a_website_events’)
),
rows_for_biguery)
if errors != []:
raise Exception(json.dumps(errors))
return {
"statusCode": 200,
“records": lambda_output_records
}
!33
Send events to BigQuery & S3
@martin_loetzsch
Also possible: send to Google Analytics, Mixpanel, Segment, Heap etc.
!34
Thank you!
@martin_loetzsch
Questions?
!35
Bonus track: consistency & correctness
@martin_loetzsch
It’s easy to make mistakes during ETL


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;



CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
);

INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary');



CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
);

INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3);

Customers per country?


SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city 

ON customer.city_fk = s.city.city_id
GROUP BY country_name;



Back up all assumptions about data by constraints


ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);


ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated.

ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates
foreign key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
!36
Referential consistency
@martin_loetzsch
Only very little overhead, will save your ass
10/18/2017 2017-10-18-dwh-schema-pav.svg
customer
customer_id
first_order_fk
favourite_product_fk
lifetime_revenue
product
product_id
revenue_last_6_months
order
order_id
processed_order_id
customer_fk
product_fk
revenue
Never repeat “business logic”


SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed',

'proposal_for_change');




SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order;



SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order;









Refactor pipeline
Create separate task that computes everything we know about an order
Usually difficult in real life











Load → preprocess → transform → flatten-fact
!37
Computational consistency
@martin_loetzsch
Requires discipline
load-product load-order load-customer
preprocess-product preprocess-order preprocess-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact
CREATE FUNCTION m_tmp.normalize_utm_source(TEXT)
RETURNS TEXT AS $$
SELECT
CASE
WHEN $1 LIKE '%.%' THEN lower($1)
WHEN $1 = '(direct)' THEN 'Direct'
WHEN $1 LIKE 'Untracked%' OR $1 LIKE '(%)'
THEN $1
ELSE initcap($1)
END;
$$ LANGUAGE SQL IMMUTABLE;

CREATE FUNCTION util.norm_phone_number(phone_number TEXT)
RETURNS TEXT AS $$
BEGIN
phone_number := TRIM(phone_number);
phone_number := regexp_replace(phone_number, '(0)', '');
phone_number
:= regexp_replace(phone_number, '[^[:digit:]]', '', 'g');
phone_number
:= regexp_replace(phone_number, '^(+49|0049|49)', '0');
phone_number := regexp_replace(phone_number, '^(00)', '');
phone_number := COALESCE(phone_number, '');
RETURN phone_number;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE;

CREATE FUNCTION m_tmp.compute_ad_id(id BIGINT, api m_tmp.API)
RETURNS BIGINT AS $$
-- creates a collision free ad id from an id in a source system
SELECT ((CASE api
WHEN 'adwords' THEN 1
WHEN 'bing' THEN 2
WHEN 'criteo' THEN 3
WHEN 'facebook' THEN 4
WHEN 'backend' THEN 5
END) * 10 ^ 18) :: BIGINT + id
$$ LANGUAGE SQL IMMUTABLE;





CREATE FUNCTION pv.date_to_supplier_period_start(INTEGER)
RETURNS INTEGER AS $$
-- this maps all dates to either a integer which is included
-- in lieferantenrabatt.period_start or
-- null (meaning we don't have a lieferantenrabatt for it)
SELECT
CASE
WHEN $1 >= 20170501 THEN 20170501
WHEN $1 >= 20151231 THEN 20151231
ELSE 20151231
END;
$$ LANGUAGE SQL IMMUTABLE;
!38
When not possible: use functions
@martin_loetzsch
Almost no performance overhead
Check for “lost” rows


SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact');







Check consistency across cubes / domains


SELECT util.assert_almost_equal(
'The number of first orders should be the same in '

|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;',

'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
1.0
);

Check completeness of source data


SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days''');



Check correctness of redistribution transformations


SELECT util.assert_almost_equal_relative(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
!39
Data consistency checks
@martin_loetzsch
Makes changing things easy
Execute queries and compare results

CREATE FUNCTION util.assert(description TEXT, query TEXT)
RETURNS BOOLEAN AS $$
DECLARE
succeeded BOOLEAN;
BEGIN
EXECUTE query INTO succeeded;
IF NOT succeeded THEN RAISE EXCEPTION 'assertion failed:
# % #
%', description, query;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql';







CREATE FUNCTION util.assert_almost_equal_relative(
description TEXT, query1 TEXT,
query2 TEXT, percentage DECIMAL)
RETURNS BOOLEAN AS $$
DECLARE
result1 NUMERIC;
result2 NUMERIC;
succeeded BOOLEAN;
BEGIN
EXECUTE query1 INTO result1;
EXECUTE query2 INTO result2;
EXECUTE 'SELECT abs(' || result2 || ' - ' || result1 || ') / '
|| result1 || ' < ' || percentage INTO succeeded;
IF NOT succeeded THEN RAISE WARNING '%
assertion failed: abs(% - %) / % < %
%: (%)
%: (%)', description, result2, result1, result1, percentage,
result1, query1, result2, query2;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql';
!40
Consistency check functions
@martin_loetzsch
Also: assert_not_found, assert_equal_table, assert_smaller_than_or_equal
Yes, unit tests
SELECT util.assert_value_equal('test_german_number_with_country_prefix', util.norm_phone_number('00491234'), '01234');
SELECT util.assert_value_equal('test_german_number_not_to_be_confused_with_country_prefix', util.norm_phone_number('0491234'), '0491234');
SELECT util.assert_value_equal('test_non_german_number_with_plus', util.norm_phone_number('+44 1234'), '441234');
SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero', util.norm_phone_number('+49 (0)1234'), '01234');
SELECT util.assert_value_equal('test__trim', util.norm_phone_number(' 0491234 '), '0491234');
SELECT util.assert_value_equal('test_number_with_leading_wildcard_symbol', util.norm_phone_number('*+436504834933'), '436504834933');
SELECT util.assert_value_equal('test_NULL', util.norm_phone_number(NULL), '');
SELECT util.assert_value_equal('test_empty', util.norm_phone_number(''), '');
SELECT util.assert_value_equal('test_wildcard_only', util.norm_phone_number('*'), '');
SELECT util.assert_value_equal('test_foreign_number_with_two_leading_zeroes', util.norm_phone_number('*00436769553701'), '436769553701');
SELECT util.assert_value_equal('test_domestic_number_with_trailing_letters', util.norm_phone_number('017678402HORST'), '017678402');
SELECT util.assert_value_equal('test_domestic_number_with_leading_letters', util.norm_phone_number('HORST017678402'), '017678402');
SELECT util.assert_value_equal('test_domestic_number_with_letters_in_between', util.norm_phone_number('0H1O7R6S7T8402'), '017678402');
SELECT util.assert_value_equal('test_german_number_with_country_prefix_and_leading_letters',
util.norm_phone_number('HORST00491234'), '01234');
SELECT util.assert_value_equal(‘test_german_number_not_to_be_confused_with_country_prefix_and_leading_letters',
util.norm_phone_number('HORST0491234'), '0491234');
SELECT util.assert_value_equal('test_non_german_number_with_plus_and_leading_letters', util.norm_phone_number('HORST+44 1234'), '441234');
SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero_and_leading_letters',
util.norm_phone_number('HORST+49 (0)1234'), ‘01234');
!41
Unit tests
@martin_loetzsch
People enter horrible telephone numbers into websites
Contribution margin 3a
SELECT order_item_id,
((((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((COALESCE(item_net_purchase_price, 0)::REAL
+ COALESCE(alcohol_tax, 0)::REAL)
+ COALESCE(import_tax, 0)::REAL))
- (COALESCE(net_fulfillment_costs, 0)::REAL
+ COALESCE(net_payment_costs, 0)::REAL))
- COALESCE(net_return_costs, 0)::REAL)
- ((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue, 0)::REAL)
- COALESCE(voucher_gross_amount, 0)::REAL)
* (1 - ((COALESCE(item_tax_amount, 0)::REAL
+ (COALESCE(gross_shipping_revenue, 0)::REAL
- COALESCE(net_shipping_revenue, 0)::REAL))
/ NULLIF(((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue,
0)::REAL), 0))))))
- COALESCE(goodie_cost_per_item, 0)::REAL) :: DOUBLE PRECISION
AS "Contribution margin 3a"
FROM dim.sales_fact;
Use schemas between reporting and database
Mondrian
LookerML
your own
Or: Pre-compute metrics in database
!42
Semantic consistency
@martin_loetzsch
Changing the meaning of metrics across all dashboards needs to be easy

More Related Content

What's hot

MySQL partitions tutorial
MySQL partitions tutorialMySQL partitions tutorial
MySQL partitions tutorialGiuseppe Maxia
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid arupmalakar
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...HostedbyConfluent
 
Data Preparation Fundamentals
Data Preparation FundamentalsData Preparation Fundamentals
Data Preparation FundamentalsDATAVERSITY
 
Snowflake Data Governance
Snowflake Data GovernanceSnowflake Data Governance
Snowflake Data Governancessuser538b022
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneAngel Abundez
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...Márton Kodok
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksSarah Dutkiewicz
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleDATAVERSITY
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
35 power bi presentations
35 power bi presentations35 power bi presentations
35 power bi presentationsSean Brady
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company PresentationAndrewJiang18
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 

What's hot (20)

MySQL partitions tutorial
MySQL partitions tutorialMySQL partitions tutorial
MySQL partitions tutorial
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
 
Data Preparation Fundamentals
Data Preparation FundamentalsData Preparation Fundamentals
Data Preparation Fundamentals
 
Snowflake Data Governance
Snowflake Data GovernanceSnowflake Data Governance
Snowflake Data Governance
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...BigQuery best practices and recommendations to reduce costs with BI Engine, S...
BigQuery best practices and recommendations to reduce costs with BI Engine, S...
 
Predicting Flights with Azure Databricks
Predicting Flights with Azure DatabricksPredicting Flights with Azure Databricks
Predicting Flights with Azure Databricks
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at ScaleHow to Use a Semantic Layer to Deliver Actionable Insights at Scale
How to Use a Semantic Layer to Deliver Actionable Insights at Scale
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
35 power bi presentations
35 power bi presentations35 power bi presentations
35 power bi presentations
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
Snowflake Company Presentation
Snowflake Company PresentationSnowflake Company Presentation
Snowflake Company Presentation
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 

Similar to Data infrastructure for the other 90% of companies

Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Martin Loetzsch
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
 
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]CARTO
 
The Sum of our Parts: the Complete CARTO Journey [CARTO]
The Sum of our Parts: the Complete CARTO Journey [CARTO]The Sum of our Parts: the Complete CARTO Journey [CARTO]
The Sum of our Parts: the Complete CARTO Journey [CARTO]CARTO
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development Open Party
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra QUONTRASOLUTIONS
 
14 functional design
14 functional design14 functional design
14 functional designrandhirlpu
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligenceAhsan Kabir
 
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minIRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minDigendra Vir Singh (DV)
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesValmik Potbhare
 
Love Your Database Railsconf 2017
Love Your Database Railsconf 2017Love Your Database Railsconf 2017
Love Your Database Railsconf 2017gisborne
 
VSSML18. Data Transformations
VSSML18. Data TransformationsVSSML18. Data Transformations
VSSML18. Data TransformationsBigML, Inc
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Denodo
 
Databaseconcepts
DatabaseconceptsDatabaseconcepts
Databaseconceptsdilipkkr
 
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptxWhat Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptxProductdata Scrape
 

Similar to Data infrastructure for the other 90% of companies (20)

Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
Lightweight ETL pipelines with mara (PyData Berlin September Meetup)
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
 
The Sum of our Parts: the Complete CARTO Journey [CARTO]
The Sum of our Parts: the Complete CARTO Journey [CARTO]The Sum of our Parts: the Complete CARTO Journey [CARTO]
The Sum of our Parts: the Complete CARTO Journey [CARTO]
 
Evolutionary db development
Evolutionary db development Evolutionary db development
Evolutionary db development
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra
 
14 functional design
14 functional design14 functional design
14 functional design
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligence
 
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minIRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
 
MaheshCV_Yepme
MaheshCV_YepmeMaheshCV_Yepme
MaheshCV_Yepme
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Love Your Database Railsconf 2017
Love Your Database Railsconf 2017Love Your Database Railsconf 2017
Love Your Database Railsconf 2017
 
VSSML18. Data Transformations
VSSML18. Data TransformationsVSSML18. Data Transformations
VSSML18. Data Transformations
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
 
Databaseconcepts
DatabaseconceptsDatabaseconcepts
Databaseconcepts
 
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptxWhat Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
 

Recently uploaded

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Recently uploaded (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 

Data infrastructure for the other 90% of companies

  • 1. @martin_loetzsch Dr. Martin Loetzsch Data Council Meetup Kickoff in Berlin May 2019 Data infrastructure for the other 90% of companies
  • 3. !3 Your are not Google @martin_loetzsch You are also not Amazon, LinkedIn, Uber etc. https://www.slideshare.net/zhenxiao/real-time-analytics-at-uber- strata-data-2019 https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
  • 5. All the data of the company in one place 
 Data is the single source of truth easy to access documented embedded into the organisation Integration of different domains
 
 
 
 
 
 
 
 
 
 
 Main challenges Consistency & correctness Changeability Complexity Transparency !5 Data warehouse = integrated data @martin_loetzsch Nowadays required for running a business application databases events csv files apis reporting crm marketing … search pricing DWH orders users products price 
 histories emails clicks … … operation
 events
  • 8. Warehouse adoption from 2016 to today 
 
 
 
 
 
 
 
 
 
 
 
 Based on three years of segment.io customer adoption (https://twitter.com/segment/status/1125891660800413697)
 BigQuery When you have terabytes of stuff For raw event storage & processing 
 Snowflake When you are not able to run a database yourself When you can’t write efficient queries Redshift Can’t recommend for ETL When OLAP performance becomes a problem ClickHouse For very fast OLAP !8 If in doubt, use PostgreSQL @martin_loetzsch Boring, battle-tested work horse
  • 10. Data pipelines as code SQL files, python & shell scripts Structure & content of data warehouse are result of running code 
 Easy to debug & inspect Develop locally, test on staging system, then deploy to production Functional data engineering Reproducability, idempotency, (immutability). Running a pipeline on the same inputs will always produce the same results. No side effects. https://medium.com/@maximebeauchemin/functional-data-engineering- a-modern-paradigm-for-batch-data-processing-2327ec32c42a !10 Make changing and testing things easy @martin_loetzsch Apply standard software engineering best practices
  • 11. from execute_sql import execute_sql_file, execute_sql_statment # create utility functions in PostgreSQL execute_sql_statment('DROP SCHEMA IF EXISTS util CASCADE; CREATE SCHEMA util;') execute_sql_statment('CREATE EXTENSION IF NOT EXISTS pg_trgm;') execute_sql_file('utils/indexes_and_constraints.sql', echo_queries=False) execute_sql_file('utils/schema_switching.sql', echo_queries=False) execute_sql_file('utils/consistency_checks.sql', echo_queries=False) # create tmp and dim_next schema execute_sql_statment('DROP SCHEMA IF EXISTS c_temp CASCADE; CREATE SCHEMA c_temp;') execute_sql_statment('DROP SCHEMA IF EXISTS dim_next CASCADE; CREATE SCHEMA dim_next;') # preprocess / combine entities execute_sql_file('contacts/preprocess_contacts_dates.sql') execute_sql_file('contacts/preprocess_contacts.sql') execute_sql_file('organisations/preprocess_organizations.sql') execute_sql_file('products/preprocess_products.sql') execute_sql_file('deals/preprocess_dealflow.sql') execute_sql_file('deals/preprocess_deal.sql') execute_sql_file('marketing/preprocess_website_visitors.sql') execute_sql_file('marketing/ create_contacts_performance_attribution.sql') # create reporting schema, establish foreign keys, transfer aggregates execute_sql_file('contacts/transform_contacts.sql') execute_sql_file('organisations/transform_organizations.sql') execute_sql_file('organisations/create_org_data_set.sql') execute_sql_file('deals/transform_deal.sql') execute_sql_file('deals/flatten_deal_fact.sql') execute_sql_file('deals/create_deal_data_set.sql') execute_sql_file('marketing/transform_marketing_performance.sql') execute_sql_file('targets/preprocess_target.sql') execute_sql_file('targets/transform_target.sql') execute_sql_file('constrain_tables.sql') # Consistency checks execute_sql_file('consistency_checks.sql') # replace the current version of the reporting schema with the next execute_sql_statment( "SELECT util.replace_schema('dim', 'dim_next')") !11 Simple scripts @martin_loetzsch SQL, python & bash
  • 12. !12 Schedule via Jenkins, Rundeck, etc. @martin_loetzsch Such a setup usually enough for a lot of companies
  • 13. Works well For simple incremental transformations on large partitioned data sets For moving data
 Does not work so well When you have complex business logic/ lot of pipeline branching When you can’t partition your data well Weird workarounds needed For finding out when something went wrong For running something again (need to manually delete task instances) For deploying things while pipelines are running For dynamic dags (decide at run-time what to run)
 
 Usually: BashOperator for everything High devops complexity !13 When you have actual big data: Airflow @martin_loetzsch Caveat: I know only very few teams that are very productive in using it
  • 14. Jinja-templated queries, DAGs created from dependencies with source as ( select * from {{var('ticket_tags_table')}} ), renamed as ( select --ids {{dbt_utils.surrogate_key ( ‘_sdc_level_0_id', '_sdc_source_key_id')}} as tag_id, _sdc_source_key_id as ticket_id, --fields nullif(lower(value), '') as ticket_tag_value from source ) select * from renamed
 Additional meta data in YAML files version: 2 models: - name: zendesk_organizations columns: - name: organization_id tests: - unique - not_null - name: zendesk_tickets columns: - name: ticket_id tests: - unique - not_null - name: zendesk_users columns: - name: user_id tests: - unique - not_null !14 Decent: DBT (data build tool) @martin_loetzsch Prerequisite: Data already in DB, everything can be done in SQL
  • 15. !15 Mara: Things that worked for us @martin_loetzsch https://github.com/mara
  • 17. Runnable app Integrates PyPI project download stats with 
 Github repo events !17 Example: Python project stats data warehouse @martin_loetzsch https://github.com/mara/mara-example-project
  • 18. Example pipeline pipeline = Pipeline( id="pypi", description="Builds a PyPI downloads cube using the public ..”) # .. pipeline.add( Task(id=“transform_python_version", description=‘..’, commands=[ ExecuteSQL(sql_file_name="transform_python_version.sql") ]), upstreams=['read_download_counts']) pipeline.add( ParallelExecuteSQL( id=“transform_download_counts", description=“..”, sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);", parameter_function=etl_tools.utils.chunk_parameter_function, parameter_placeholders=["@chunk@"], commands_before=[ ExecuteSQL(sql_file_name="transform_download_counts.sql") ]), upstreams=["preprocess_project_version", "transform_installer", "transform_python_version"]) !18 ETL pipelines as code @martin_loetzsch Pipeline = list of tasks with dependencies between them. Task = list of commands
  • 19. Target of computation 
 CREATE TABLE m_dim_next.region (
 region_id SMALLINT PRIMARY KEY,
 region_name TEXT NOT NULL UNIQUE,
 country_id SMALLINT NOT NULL,
 country_name TEXT NOT NULL,
 _region_name TEXT NOT NULL
 );
 
 Do computation and store result in table 
 WITH raw_region AS (SELECT DISTINCT country,
 region
 FROM m_data.ga_session
 ORDER BY country, region)
 
 INSERT INTO m_dim_next.region SELECT row_number() OVER (ORDER BY country, region ) AS region_id,
 CASE WHEN (SELECT count(DISTINCT country) FROM raw_region r2 WHERE r2.region = r1.region) > 1 THEN region || ' / ' || country ELSE region END AS region_name, dense_rank() OVER (ORDER BY country) AS country_id, country AS country_name, region AS _region_name FROM raw_region r1;
 INSERT INTO m_dim_next.region VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
 Speedup subsequent transformations 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['_region_name', ‘country_name', 'region_id']);
 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['country_id', 'region_id']);
 
 ANALYZE m_dim_next.region; !19 PostgreSQL as a data processing engine @martin_loetzsch Leave data in DB, Tables as (intermediate) results of processing steps
  • 20. Execute query 
 ExecuteSQL(sql_file_name=“preprocess-ad.sql") cat app/data_integration/pipelines/facebook/preprocess-ad.sql | PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all -—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
 Read file 
 ReadFile(file_name=“country_iso_code.csv", compression=Compression.NONE, target_table="os_data.country_iso_code", mapper_script_file_name=“read-country-iso-codes.py", delimiter_char=“;") cat "dwh-data/country_iso_code.csv" | .venv/bin/python3.6 "app/data_integration/pipelines/load_data/ read-country-iso-codes.py" | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.country_iso_code FROM STDIN WITH CSV DELIMITER AS ';'"
 Copy from other databases 
 Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm", target_table=“os_data.product", replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps", "@@client@@": "kfzteile24 GmbH"})
 cat app/data_integration/pipelines/load_data/pdm/load-product.sql | sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/ kfzteile24 GmbH/g" | sed 's/$/$/g;s/$/$/g' | (cat && echo ';') | (cat && echo '; go') | sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.product FROM STDIN WITH CSV HEADER" !20 Shell commands as interface to data & DBs @martin_loetzsch Nothing is faster than a unix pipe
  • 21. Read a set of files 
 pipeline.add( ParallelReadFile( id="read_download", description="Loads PyPI downloads from pre_downloaded csv files", file_pattern="*/*/*/pypi/downloads-v1.csv.gz", read_mode=ReadMode.ONLY_NEW, compression=Compression.GZIP, target_table="pypi_data.download", delimiter_char="t", skip_header=True, csv_format=True, file_dependencies=read_download_file_dependencies, date_regex="^(?P<year>d{4})/(?P<month>d{2})/(? P<day>d{2})/", partition_target_table_by_day_id=True, timezone="UTC", commands_before=[ ExecuteSQL( sql_file_name="create_download_data_table.sql", file_dependencies=read_download_file_dependencies) ])) Split large joins into chunks pipeline.add( ParallelExecuteSQL( id="transform_download", description="Maps downloads to their dimensions", sql_statement="SELECT pypi_tmp.insert_download(@chunk@::SMALLINT);”, parameter_function= etl_tools.utils.chunk_parameter_function, parameter_placeholders=["@chunk@"], commands_before=[ ExecuteSQL(sql_file_name="transform_download.sql") ]), upstreams=["preprocess_project_version", "transform_installer"]) !21 Incremental & parallel processing @martin_loetzsch You can’t join all clicks with all customers at once
  • 23. Always: At least two out of VPN IP restriction SSH tunnel Don’t rely on passwords, they tend to be shared and are not changed when someone leaves the company SSH keys Single sign-on !23 Please don’t put your data on the internet @martin_loetzsch Also: GDPR
  • 24. Authenticates each incoming request against auth provider 
 We run it in front all web interfaces Including Tableau, Jenkins, Metabase Many auth providers: Google, Azure AD, etc. Much better than passwords! !24 SSO with oauth2_proxy @martin_loetzsch https://github.com/pusher/oauth2_proxy
  • 25. Implemented using the auth_request directive in nginx server { listen 443 default ssl; include ssl.conf; location /oauth2/ { proxy_pass http://127.0.0.1:4180; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Scheme $scheme; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header X-Auth-Request-Redirect $request_uri; } location = /oauth2/auth { proxy_pass http://127.0.0.1:4180; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Scheme $scheme; proxy_set_header X-Forwarded-Proto $scheme; # nginx auth_request includes headers but not body proxy_set_header Content-Length ""; proxy_pass_request_body off; } location / { auth_request /oauth2/auth; error_page 401 = /oauth2/sign_in; auth_request_set $email $upstream_http_x_auth_request_email; proxy_set_header X-Forwarded-Email $email; auth_request_set $auth_cookie $upstream_http_set_cookie; add_header Set-Cookie $auth_cookie; proxy_set_header Host $host; proxy_set_header X-Scheme $scheme; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-Proto $scheme; proxy_send_timeout 600; proxy_read_timeout 600; proxy_buffering off; send_timeout 600; # pass to downstreams proxy_pass http://127.0.0.1:81; } access_log /bi/logs/nginx/access-443-default.log; error_log /bi/logs/nginx/error-443-default.log; } !25 SSO with oauth2_proxy II @martin_loetzsch https://github.com/pusher/oauth2_proxy
  • 27. Ad Blockers 
 
 
 
 
 
 
 
 
 
 
 
 Blind Spots Not all user interactions happen online (returns, call center requests) Some data can not be leaked to pixel: segments, prices, backend logic Solution: server side tracking Collect events on the server rather on the client 
 
 
 
 
 Use cases & advantages Ground truth: correct metrics recorded in marketing tools Price: provide a cheaper alternative to GA Premium / Segment etc. GDPR compliance: own data, avoid black boxes Product analytics: Understand bottlenecks in user journey Unified user user journey: combine events from multiple touchpoints Site speed: Front-ends are not slowed down by analytics pixels SEO: Measure site indexing by search engines !27 Pixel based tracking is dead @martin_loetzsch Server side tracking: surprisingly easy
  • 28. !28 Works: Kinesis + AWS Lamda + S3 + BigQuery @martin_loetzsch Technologies don’t matter: would also work with Google cloud & Azure User Project A Website AWS Kinesis Firehose AWS Lambda Amazon S3 Google BigQuery
  • 29. Everything the backend knows about the user, the context, the content { "visitor_id": "10d1fa9c9fd39cff44c88bd551b1ab4dfe92b3da", "session_id": "9tv1phlqkl5kchajmb9k2j2434", "timestamp": "2018-12-16T16:03:04+00:00", "ip": "92.195.48.163", "url": "https://www.project-a.com/en/career/jobs/data-engineer-data-scientist-m-f-d-4072020002?gh_jid=4082751002&gh_src=9fcd30182&&utm_medium=social&u "host": "www.project-a.com", "path": [ "en", "career", "jobs", "data-engineer-data-scientist-m-f-d-4072020002" ], "query": { "gh_jid": "4082751002", "gh_src": "9fcd30182", "utm_medium": "social", "utm_source": "linkedin", "utm_campaign": "buffer" }, "referrer": null, "language": "en", "ua": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36" } !29 Backend collects user interaction data @martin_loetzsch And sends it asynchronously to a queue
  • 30. Kirby tracking plugin for Project A web site <?php require 'vendor/autoload.php'; // this cookie is set when not present $cookieName = 'visitor'; // retrieve visitor id from cookie $visitorId = array_key_exists($cookieName, $_COOKIE) ? $_COOKIE[$cookieName] : null; if (!$visitorId) { // visitor cookie not set. Use session id as visitor ID $visitorId = sha1(uniqid(s::id(), true)); setcookie($cookieName, $visitorId, time() + (2 * 365 * 24 * 60 * 60), '/', 'project-a.com', false, true); } $request = kirby()->request(); // the payload to log $event = [ 'visitor_id' => $visitorId, 'session_id' => s::id(), 'timestamp' => date('c'), 'ip' => $request->__debuginfo()[‘ip'], 'url' => $request->url(), 'host' => $_SERVER['SERVER_NAME'], 'path' => $request->path()->toArray(), 'query' => $request->query()->toArray(), 'referrer' => visitor::referer(), 'language' => visitor::acceptedLanguageCode(), 'ua' => visitor::userAgent() ]; $firehoseClient = new AwsFirehoseFirehoseClient([ // secrets ]); // publish message to firehose delivery stream $promise = $firehoseClient->putRecordAsync([ 'DeliveryStreamName' => ‘kinesis-firehose-stream-123', 'Record' => ['Data' => json_encode($event)] ]); register_shutdown_function(function () use ($promise) { $promise->wait(); }); !30 Custom implementation for each web site @martin_loetzsch Good news: there is usually already code that populates the data layer
  • 31. Result: enhanced with GEO information & device detection { "visitor_id": "10d1fa9c9fd39cff44c88bd551b1ab4dfe92b3da", "session_id": "9tv1phlqkl5kchajmb9k2j2434", "timestamp": "2018-12-16T16:03:04+00:00", "url": "https://www.project-a.com/en/career/jobs/data-engineer- data-scientist-m-f-d-4072020002? gh_jid=4082751002&gh_src=9fcd30182&&utm_medium=social&utm_source=li nkedin&utm_campaign=buffer", "host": "www.project-a.com", "path": [ "en", "career", "jobs", "data-engineer-data-scientist-m-f-d-4072020002" ], "query": [ { "param": "gh_jid", "value": "4082751002" }, { "param": "gh_src", "value": "9fcd30182" }, { "param": "utm_medium", "value": "social" }, { "param": "utm_source", "value": "linkedin" }, { "param": "utm_campaign", "value": "buffer" } ], "referrer": null, "language": "en", "browser_family": "Chrome", "browser_version": "70", "os_family": "Mac OS X", "os_version": "10", "device_brand": null, "device_model": null, "country_iso_code": "DE", "country_name": "Germany", "subdivisions_iso_code": "SN", "subdivisions_name": "Saxony", "city_name": "Dresden" } !31 Lambda function transforms and stores events @martin_loetzsch GDPR: remove IP address, user agent
  • 32. Lambda function part I import base64 import functools import json import geoip2.database from google.cloud import bigquery from ua_parser import user_agent_parser @functools.lru_cache(maxsize=None) def get_geo_db(): return geoip2.database.Reader('./GeoLite2-City_20181002/ GeoLite2-City.mmdb') def extract_geo_data(ip): """Does a geo lookup for an IP address""" response = get_geo_db().city(ip) return { 'country_iso_code': response.country.iso_code, 'country_name': response.country.name, 'subdivisions_iso_code': response.subdivisions.most_specific.iso_code, 'subdivisions_name': response.subdivisions.most_specific.name, 'city_name': response.city.name } def parse_user_agent(user_agent_string): """Extracts browser, OS and device information from an user agent""" result = user_agent_parser.Parse(user_agent_string) return { 'browser_family': result['user_agent']['family'], 'browser_version': result['user_agent']['major'], 'os_family': result['os']['family'], 'os_version': result['os']['major'], 'device_brand': result['device']['brand'], 'device_model': result['device']['model'] } !32 Geo Lookup & device detection @martin_loetzsch Solved problem, very good open source libraries exist
  • 33. Lambda function part II def lambda_handler(event, context): lambda_output_records = [] rows_for_biguery = [] bq_client = bigquery.Client.from_service_account_json( ‘big-query-credendials.json’ ) for record in event['records']: message = json.loads(base64.b64decode(record['data'])) # extract browser, device, os if message['ua']: message.update(parse_user_agent(message['ua'])) del message['ua'] # geo lookup for ip address message.update(extract_geo_data(message['ip'])) del message['ip'] # update get parameters if message['query']: message['query'] = [ {'param': param, 'value': value} for param, value in message[‘query’].items() ] rows_for_biguery.append(message) lambda_output_records.append({ 'recordId': record['recordId'], 'result': 'Ok', ‘data': base64.b64encode( json.dumps(message).encode('utf-8')).decode('utf-8') }) errors = bq_client.insert_rows( bq_client.get_table( bq_client.dataset(‘server_side_tracking') .table(‘project_a_website_events’) ), rows_for_biguery) if errors != []: raise Exception(json.dumps(errors)) return { "statusCode": 200, “records": lambda_output_records } !33 Send events to BigQuery & S3 @martin_loetzsch Also possible: send to Google Analytics, Mixpanel, Segment, Heap etc.
  • 35. !35 Bonus track: consistency & correctness @martin_loetzsch
  • 36. It’s easy to make mistakes during ETL 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
 
 CREATE TABLE s.city ( city_id SMALLINT, city_name TEXT, country_name TEXT );
 INSERT INTO s.city VALUES (1, 'Berlin', 'Germany'), (2, 'Budapest', 'Hungary');
 
 CREATE TABLE s.customer ( customer_id BIGINT, city_fk SMALLINT );
 INSERT INTO s.customer VALUES (1, 1), (1, 2), (2, 3);
 Customers per country? 
 SELECT country_name, count(*) AS number_of_customers FROM s.customer JOIN s.city 
 ON customer.city_fk = s.city.city_id GROUP BY country_name;
 
 Back up all assumptions about data by constraints 
 ALTER TABLE s.city ADD PRIMARY KEY (city_id); ALTER TABLE s.city ADD UNIQUE (city_name); ALTER TABLE s.city ADD UNIQUE (city_name, country_name); 
 ALTER TABLE s.customer ADD PRIMARY KEY (customer_id); [23505] ERROR: could not create unique index "customer_pkey" Detail: Key (customer_id)=(1) is duplicated.
 ALTER TABLE s.customer ADD FOREIGN KEY (city_fk) REFERENCES s.city (city_id); [23503] ERROR: insert or update on table "customer" violates foreign key constraint "customer_city_fk_fkey" Detail: Key (city_fk)=(3) is not present in table "city" !36 Referential consistency @martin_loetzsch Only very little overhead, will save your ass
  • 37. 10/18/2017 2017-10-18-dwh-schema-pav.svg customer customer_id first_order_fk favourite_product_fk lifetime_revenue product product_id revenue_last_6_months order order_id processed_order_id customer_fk product_fk revenue Never repeat “business logic” 
 SELECT sum(total_price) AS revenue FROM os_data.order WHERE status IN ('pending', 'accepted', 'completed',
 'proposal_for_change'); 
 
 SELECT CASE WHEN (status <> 'started' AND payment_status = 'authorised' AND order_type <> 'backend') THEN o.order_id END AS processed_order_fk FROM os_data.order;
 
 SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed FROM os_data.order;
 
 
 
 
 Refactor pipeline Create separate task that computes everything we know about an order Usually difficult in real life
 
 
 
 
 
 Load → preprocess → transform → flatten-fact !37 Computational consistency @martin_loetzsch Requires discipline load-product load-order load-customer preprocess-product preprocess-order preprocess-customer transform-product transform-order transform-customer flatten-product-fact flatten-order-fact flatten-customer-fact
  • 38. CREATE FUNCTION m_tmp.normalize_utm_source(TEXT) RETURNS TEXT AS $$ SELECT CASE WHEN $1 LIKE '%.%' THEN lower($1) WHEN $1 = '(direct)' THEN 'Direct' WHEN $1 LIKE 'Untracked%' OR $1 LIKE '(%)' THEN $1 ELSE initcap($1) END; $$ LANGUAGE SQL IMMUTABLE;
 CREATE FUNCTION util.norm_phone_number(phone_number TEXT) RETURNS TEXT AS $$ BEGIN phone_number := TRIM(phone_number); phone_number := regexp_replace(phone_number, '(0)', ''); phone_number := regexp_replace(phone_number, '[^[:digit:]]', '', 'g'); phone_number := regexp_replace(phone_number, '^(+49|0049|49)', '0'); phone_number := regexp_replace(phone_number, '^(00)', ''); phone_number := COALESCE(phone_number, ''); RETURN phone_number; END; $$ LANGUAGE PLPGSQL IMMUTABLE;
 CREATE FUNCTION m_tmp.compute_ad_id(id BIGINT, api m_tmp.API) RETURNS BIGINT AS $$ -- creates a collision free ad id from an id in a source system SELECT ((CASE api WHEN 'adwords' THEN 1 WHEN 'bing' THEN 2 WHEN 'criteo' THEN 3 WHEN 'facebook' THEN 4 WHEN 'backend' THEN 5 END) * 10 ^ 18) :: BIGINT + id $$ LANGUAGE SQL IMMUTABLE;
 
 
 CREATE FUNCTION pv.date_to_supplier_period_start(INTEGER) RETURNS INTEGER AS $$ -- this maps all dates to either a integer which is included -- in lieferantenrabatt.period_start or -- null (meaning we don't have a lieferantenrabatt for it) SELECT CASE WHEN $1 >= 20170501 THEN 20170501 WHEN $1 >= 20151231 THEN 20151231 ELSE 20151231 END; $$ LANGUAGE SQL IMMUTABLE; !38 When not possible: use functions @martin_loetzsch Almost no performance overhead
  • 39. Check for “lost” rows 
 SELECT util.assert_equal( 'The order items fact table should contain all order items', 'SELECT count(*) FROM os_dim.order_item', 'SELECT count(*) FROM os_dim.order_items_fact');
 
 
 
 Check consistency across cubes / domains 
 SELECT util.assert_almost_equal( 'The number of first orders should be the same in '
 || 'orders and marketing touchpoints cube', 'SELECT count(net_order_id) FROM os_dim.order WHERE _net_order_rank = 1;',
 'SELECT (SELECT sum(number_of_first_net_orders) FROM m_dim.acquisition_performance) / (SELECT count(*) FROM m_dim.performance_attribution_model)', 1.0 );
 Check completeness of source data 
 SELECT util.assert_not_found( 'Each adwords campaign must have the attribute "Channel"', 'SELECT DISTINCT campaign_name, account_name FROM aw_tmp.ad JOIN aw_dim.ad_performance ON ad_fk = ad_id WHERE attributes->>''Channel'' IS NULL AND impressions > 0 AND _date > now() - INTERVAL ''30 days''');
 
 Check correctness of redistribution transformations 
 SELECT util.assert_almost_equal_relative( 'The cost of non-converting touchpoints must match the' || 'redistributed customer acquisition and reactivation cost', 'SELECT sum(cost) FROM m_tmp.cost_of_non_converting_touchpoints;', 'SELECT (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_acquisition_cost) + (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_reactivation_cost);', 0.00001); !39 Data consistency checks @martin_loetzsch Makes changing things easy
  • 40. Execute queries and compare results
 CREATE FUNCTION util.assert(description TEXT, query TEXT) RETURNS BOOLEAN AS $$ DECLARE succeeded BOOLEAN; BEGIN EXECUTE query INTO succeeded; IF NOT succeeded THEN RAISE EXCEPTION 'assertion failed: # % # %', description, query; END IF; RETURN succeeded; END $$ LANGUAGE 'plpgsql';
 
 
 
 CREATE FUNCTION util.assert_almost_equal_relative( description TEXT, query1 TEXT, query2 TEXT, percentage DECIMAL) RETURNS BOOLEAN AS $$ DECLARE result1 NUMERIC; result2 NUMERIC; succeeded BOOLEAN; BEGIN EXECUTE query1 INTO result1; EXECUTE query2 INTO result2; EXECUTE 'SELECT abs(' || result2 || ' - ' || result1 || ') / ' || result1 || ' < ' || percentage INTO succeeded; IF NOT succeeded THEN RAISE WARNING '% assertion failed: abs(% - %) / % < % %: (%) %: (%)', description, result2, result1, result1, percentage, result1, query1, result2, query2; END IF; RETURN succeeded; END $$ LANGUAGE 'plpgsql'; !40 Consistency check functions @martin_loetzsch Also: assert_not_found, assert_equal_table, assert_smaller_than_or_equal
  • 41. Yes, unit tests SELECT util.assert_value_equal('test_german_number_with_country_prefix', util.norm_phone_number('00491234'), '01234'); SELECT util.assert_value_equal('test_german_number_not_to_be_confused_with_country_prefix', util.norm_phone_number('0491234'), '0491234'); SELECT util.assert_value_equal('test_non_german_number_with_plus', util.norm_phone_number('+44 1234'), '441234'); SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero', util.norm_phone_number('+49 (0)1234'), '01234'); SELECT util.assert_value_equal('test__trim', util.norm_phone_number(' 0491234 '), '0491234'); SELECT util.assert_value_equal('test_number_with_leading_wildcard_symbol', util.norm_phone_number('*+436504834933'), '436504834933'); SELECT util.assert_value_equal('test_NULL', util.norm_phone_number(NULL), ''); SELECT util.assert_value_equal('test_empty', util.norm_phone_number(''), ''); SELECT util.assert_value_equal('test_wildcard_only', util.norm_phone_number('*'), ''); SELECT util.assert_value_equal('test_foreign_number_with_two_leading_zeroes', util.norm_phone_number('*00436769553701'), '436769553701'); SELECT util.assert_value_equal('test_domestic_number_with_trailing_letters', util.norm_phone_number('017678402HORST'), '017678402'); SELECT util.assert_value_equal('test_domestic_number_with_leading_letters', util.norm_phone_number('HORST017678402'), '017678402'); SELECT util.assert_value_equal('test_domestic_number_with_letters_in_between', util.norm_phone_number('0H1O7R6S7T8402'), '017678402'); SELECT util.assert_value_equal('test_german_number_with_country_prefix_and_leading_letters', util.norm_phone_number('HORST00491234'), '01234'); SELECT util.assert_value_equal(‘test_german_number_not_to_be_confused_with_country_prefix_and_leading_letters', util.norm_phone_number('HORST0491234'), '0491234'); SELECT util.assert_value_equal('test_non_german_number_with_plus_and_leading_letters', util.norm_phone_number('HORST+44 1234'), '441234'); SELECT util.assert_value_equal('test_german_number_with_prefix_and_additional_zero_and_leading_letters', util.norm_phone_number('HORST+49 (0)1234'), ‘01234'); !41 Unit tests @martin_loetzsch People enter horrible telephone numbers into websites
  • 42. Contribution margin 3a SELECT order_item_id, ((((((COALESCE(item_net_price, 0)::REAL + COALESCE(net_shipping_revenue, 0)::REAL) - ((COALESCE(item_net_purchase_price, 0)::REAL + COALESCE(alcohol_tax, 0)::REAL) + COALESCE(import_tax, 0)::REAL)) - (COALESCE(net_fulfillment_costs, 0)::REAL + COALESCE(net_payment_costs, 0)::REAL)) - COALESCE(net_return_costs, 0)::REAL) - ((COALESCE(item_net_price, 0)::REAL + COALESCE(net_shipping_revenue, 0)::REAL) - ((((COALESCE(item_net_price, 0)::REAL + COALESCE(item_tax_amount, 0)::REAL) + COALESCE(gross_shipping_revenue, 0)::REAL) - COALESCE(voucher_gross_amount, 0)::REAL) * (1 - ((COALESCE(item_tax_amount, 0)::REAL + (COALESCE(gross_shipping_revenue, 0)::REAL - COALESCE(net_shipping_revenue, 0)::REAL)) / NULLIF(((COALESCE(item_net_price, 0)::REAL + COALESCE(item_tax_amount, 0)::REAL) + COALESCE(gross_shipping_revenue, 0)::REAL), 0)))))) - COALESCE(goodie_cost_per_item, 0)::REAL) :: DOUBLE PRECISION AS "Contribution margin 3a" FROM dim.sales_fact; Use schemas between reporting and database Mondrian LookerML your own Or: Pre-compute metrics in database !42 Semantic consistency @martin_loetzsch Changing the meaning of metrics across all dashboards needs to be easy