A talk by Sebastian Herold & Dr. Arif Wider at TDWI 2018 Munich.
Abstract:
More and more companies migrate their monolithic applications to a microservices architecture. However, maintaining a consistent and usable data landscape has only become more challenging by this: huge amounts of structured and unstructured data, and hundreds of data sources.
Furthermore, data-driven product development multiplies the analytics requirements: every product team needs constantly updated and specially tailored metrics which often combine product specific data with company wide data.
Having a centralized data team does not scale in this setting as it becomes the bottleneck between data producers and data consumers.
We created a Manifesto based on five general themes which break with traditional separation of roles and show a path how to deal with distributed data in a federal and scalable fashion. This leads to DataDev: a culture shift similar to DevOps in which application developers own their data and take over responsibilities for data & analytics.
Learn about our experiences and best practices with facilitating this cultural transformation at Zalando, one of Europe's largest online fashion platforms.
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
1. #DataDevOps
A MANIFESTO FOR A DEVOPS-LIKE
CULTURE SHIFT IN DATA &
ANALYTICS
SEBASTIAN HEROLD
DR. ARIF WIDER
2018-06-26
MUNICH
2. 2
Sebastian Herold
Big Data Architect @ Zalando
@heroldamus
Previously 7 years @ Scout24
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
4. 4
Data Challenges
Data Manifesto
AI Empowerment
Data Architecture
Data-Driven Company
AGENDA
DataDevOps Culture
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
5. 5
> 300,000
product choices
as at June 2018
ZALANDO IN NUMBERS
~4.5billion EURO
revenue 2017
> 75%
of visits via
mobile devices
> 200
million
visits
per
month
> 23
millionactive customers
> 15,000
employees in
Europe
17
countries
~ 2,000
brands
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
6. 6
ZALANDO IN NUMBERS
GB/s on Kafka read
>2
People in Tech
>2000
Dev Teams
>250
MSTR User
>2000
AWS Accounts
>260
Data Scientists
>150
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
15. 15 TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
BEGINNERS EXPERTSAI MATURITY
2017
2018
2019
AI SKILLS SHIFT
TIME
16. 16
FACETS OF INSTITUTIONALISING
Processes InfrastructureData Quality
Education Marketing Serving EventsData
Metadata Sharability Compliance ConnectivityGuilds
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
17. 17
MSTR
Learn
TEAMAIMATURITY
Define Explore Extract Model Serve Observe
“LEVEL ZERO”
ANALYTICS &
REPORTING
AI EXPERTS
DATA PRODUCT JOURNEY
Basic
Training
Offers
BI Consulting
AI
Consulting
AI Literacy
Training
Expert Training
Data Science
Guild
MS Excel
Data Catalog
incl. meta
data
SQL Engine / SuperSet
Kafka
Jupyter Notebook Hub
ETL RStudio Shiny
Spark
DIFFERENT OFFERS FOR DIFFERENT PEOPLE & STEPS
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
18. 18
WHERE WE CAME FROM: DISTRIBUTED DATA PLATFORMS
ZALON’S
DATA PLATFORM
FASHION STORE’S
DATA PLATFORM
OTHER BUSINESS UNIT’S
DATA PLATFORMBI PLATFORM
OTHER BUSINESS UNIT’S
DATA PLATFORM
CENTRAL
DATA PLATFORM
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
19. 19
WHERE WE WANT TO GO: INTEGRATED DATA PLATFORM
FASHION STORE’S
DATA PLATFORM
OTHER BUSINESS UNIT’S
DATA PLATFORM
OTHER BUSINESS UNIT’S
DATA PLATFORM
CENTRAL
DATA PLATFORM ZALON’S
DATA PLATFORM
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
20. 20
DATA PLATFORM DESIGN PRINCIPLES
CLOUD FIRST DATA FRESHNESS & QUALITY
MERGE ANALYTICS AND DATA SCIENCE EMPOWER CONSUMERS AND PRODUCERS
INNOVATION
SCALABILITY
FLEXIBILITY
STREAMING
MICRO-BATCHING
BI AI
SELF-SERVICE
RESPONSIBILITIES
TOOLING
ROLES
METADATA
PROCESSES
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
24. 24
DBs
MicroServices
GAP/Others
Data Sources Ingestion Storage Processing
Event-Bus
Batch or Delta
Loads + CDC
Connectors
Data Storage
Data Catalog Orchestration
Batch Process.
Acceleration Layer
SQL Engine
Stream Process.
Data Gateway
Model
Repo
Model Serving
DATA PLATFORM ARCHITECTURE
Metadata Flow
Data Flow
25. 25
DATA PLATFORM ARCHITECTURE
DBs
MicroServices
GAP/Others
Data Sources Ingestion Storage Processing
Event-Bus
Batch or Delta
Loads + CDC
Connectors
Data Storage
Data Catalog Orchestration
Batch Process.
Acceleration Layer BI Tools
SQL/Apps
Notebooks
Data Catalog UI
SQL Engine
Stream Process.
Data Gateway
Model
Repo
Model Serving
Access
Metadata Flow
Data Flow
26. 26
DATA PLATFORM ARCHITECTURE
Governance
Processes & Glossary
DBs
MicroServices
GAP/Others
Data Sources Ingestion Storage Processing
Event-Bus
Batch or Delta
Loads + CDC
Connectors
Data Storage
Data Catalog Orchestration
Batch Process.
Acceleration Layer BI Tools
SQL/Apps
Notebooks
Data Catalog UI
SQL Engine
Stream Process.
Data Gateway
Model
Repo
Model Serving
Access
Metadata Flow
Data Flow
27. 27
MERGE OF BI AND DATA SCIENCE JOURNEY
BI Product
Journey
AI Product
Journey≈
Learn Define Explore Extract Model Serve Observe
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
28. 28
MERGE OF BI AND DATA SCIENCE JOURNEY
BI Product
Journey
AI Product
Journey≈
Explore Extract Model Serve ObserveLearn Define
LET’S FOCUS ON THE TECHNICAL PART !
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
41. 41 TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
THE DATA DRIVEN COMPANY
“The McKinsey Global Institute indicates that data driven organizations
are 23 times more likely to acquire customers, 6 times as likely to retain
those customers, and 19 times as likely to be profitable as a result.”
What does “data driven” mean
● Data is a key asset of the company
● All decisions in the company (products and processes) are data-driven, i.e. based on objective
data insights
● Data Analytics and Data Science are common place in the company
● Company-wide data-architecture in place
● Company-wide data governance rules in place
Source: https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/five-facts-how-customer-analytics-boosts-corporate-performance
42. 42
DATA MANIFESTO
THEMES FOR A
DATA-DRIVEN COMPANY
AT SCALE
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
50. 50
AWSCENTRAL DATA LAKE ON S3
ROLES & RESPONSIBILITIES
DATA CATALOG
DATA INFRA
CHECKOUT
SERVICE
PRODUCER
SPECIAL
OFFER
SERVICE
CONSUMER
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
51. 51
AWSCENTRAL DATA LAKE ON S3
ROLES & RESPONSIBILITIES
DATA CATALOG
DATA INFRA
ORDER EVENTS
EVENT METADATA
CHECKOUT
SERVICE
PRODUCER
SPECIAL
OFFER
SERVICE
CONSUMER
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
52. 52
AWSCENTRAL DATA LAKE ON S3
ROLES & RESPONSIBILITIES
ORDER EVENTS
EVENT METADATA
CHECKOUT
SERVICE
DATA CATALOG
PRODUCER
DATA INFRA
INGESTION TEMPLATE
SPECIAL
OFFER
SERVICE
CONSUMER
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
53. 53
AWSCENTRAL DATA LAKE ON S3
ROLES & RESPONSIBILITIES
ORDER EVENTS
EVENT METADATA
CHECKOUT
SERVICE
DATA CATALOG
PRODUCER
DATA INFRA
INGESTION TEMPLATE VIEW: ORDER HISTORY BY USER
SPECIAL
OFFER
SERVICE
CONSUMER
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
55. 55
DATADEVOPS
WHAT IS DEVOPS?
Distributed
Ops skills
Shared Ops
responsibilities
Self-service
platforms
Cross-functional
dev teams
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
56. 56
DATADEVOPS
WHAT IS DATADEVOPS?
Distributed
Data skills
Shared Data
responsibilities
Self-service
Data platform
Cross-functional
product teams
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
57. 57
Consequences for Product Teams
‣ Think about data & reporting
‣ Deliver your data to the lake
‣ Provide meta data
‣ Eat your own dog food: Consume your own data
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider
58. 58
Benefits for Product Teams
‣ Independently work with data
‣ No dependencies to data teams
‣ It’s easy to consume data produced by other teams
‣ Faster product & measurement iterations
TDWI’18 Munich - DataDevOps - Sebastian Herold & Arif Wider