Data pipelines observability: OpenLineage & Marquez

•

0 likes•298 views

Julien Le Dem

Data driven jan 2021

Technology

Data Pipelines Observability
OpenLineage & Marquez
Julien Le Dem
CTO & Co-Founder Datakin
@J_

AGENDA
● The need for metadata
● OpenLineage: open standard for metadata and
lineage collection
● Marquez: a reference implementation

Building a healthy data ecosystem
Team A Team B
Team C

● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
Today: Limited context

Maslow’s Data hierarchy of needs
New Business Opportunities
Business Optimization
Data Quality
Data Freshness
Data Availability

OpenLineage contributors
Creators and contributors from major open source projects involved

Purpose
Deﬁne an Open standard for metadata and lineage
collection by instrumenting data pipelines as they are
running.

Problem
Before:
● Duplication of effort: Each project
has to instrument all jobs
● Integrations are external and can
break with new versions
● Effort of integration is shared
● Integration can be pushed in
each project: no need to play
catch up
With Open Lineage

Open Lineage scope Not in scope
Backend
Integrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kafka
topic
Graph
db
HTTP
client
Consumers
Kafka
client
GraphDB
client
...

Core Model
● JSONSchema spec
● Consistent naming:
○ Jobs:
Example: scheduler.job.task
○ Datasets:
Example: instance.schema.table

Protocol
● Asynchronous events: unique run id for identifying a run and correlate events
○ Run Start event
■ source code version
■ run parameters
○ Run Complete event
■ input dataset
■ output dataset version and schema
● Conﬁgurable backend
○ Kafka
○ Http
○ ...

Facets
● Extensible:
Facets are atomic pieces of metadata identiﬁed by a unique name that can be
attached to the core entities.
● Decentralized:
Preﬁxes in facet names allow the deﬁnition of Custom facets that can be
promoted to the spec at a later point.

Metadata:
Ingest Storage Compute
Streaming
Batch/ML
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Flink
Airflow
Kafka
Iceberg / S3
BI
OpenLineage

Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1
*
1
*
Source
1 *
●
●
●
●
●
●
●
●
●
●
●
Dataset Version

API
● Open Lineage and Marquez standardize
metadata collection
○ Job runs
○ Parameters
○ Version
○ Inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Lineage analysis
Graph
Integrations

Join the conversation
OpenLineage:
Github: github.com/OpenLineage
Slack: OpenLineage.slack.com
Twitter: @OpenLineage
Email: groups.google.com/g/openlineage
Marquez:
Github: github.com/MarquezProject/marquez
Slack: MarquezProject.slack.com
Twitter: @MarquezProject

Thank You
*we’re hiring! jobs@datakin.com

What's hot

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Change Data Feed in DeltaDatabricks

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks

Free Training: How to Build a LakehouseDatabricks

Large Scale Lakehouse Implementation Using Structured StreamingDatabricks

Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation

Data Observability Best PracicesAndy Petrella

Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit

Achieving Lakehouse Models with Spark 3.0Databricks

Delta from a Data Engineer's PerspectiveDatabricks

To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...Jochem van Grondelle

Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

Apache Spark OverviewVadim Y. Bichutskiy

Emerging Trends in Data EngineeringAnanth PackkilDurai

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks

Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia

What's hot (20)

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity

Change Data Feed in Delta

Architect’s Open-Source Guide for a Data Mesh Architecture

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

Free Training: How to Build a Lakehouse

Large Scale Lakehouse Implementation Using Structured Streaming

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Data Observability Best Pracices

Using LLVM to accelerate processing of data in Apache Arrow

Achieving Lakehouse Models with Spark 3.0

Delta from a Data Engineer's Perspective

To mesh or mess up your data organisation - Jochem van Grondelle (Prosus/OLX ...

Incremental View Maintenance with Coral, DBT, and Iceberg

The columnar roadmap: Apache Parquet and Apache Arrow

Apache Spark Overview

Emerging Trends in Data Engineering

Apache Iceberg - A Table Format for Hige Analytic Datasets

Simplify and Scale Data Engineering Pipelines with Delta Lake

Making Data Timelier and More Reliable with Lakehouse Technology

Similar to Data pipelines observability: OpenLineage & Marquez

Structured Streaming in SparkDigital Vidya

Introduction to Structured Data Processing with Spark SQLdatamantra

Deploying Data Science Engines to ProductionMostafa Majidpour

Machine learning and big data @ uber a tale of two systemsZhenxiao Luo

Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks

WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan

Real time analytics on deep learning @ strata data 2019Zhenxiao Luo

A compute infrastructure for data scientistsStitch Fix Algorithms

Enterprise guide to building a Data MeshSion Smith

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData

The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4

Data Discovery and Metadatamarkgrover

Gobblin @ NerdWallet (Nov 2015)NerdWalletHQ

C2_W1---.pdfHumayun Kabir

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB

Graph Data Science at ScaleNeo4j

Data Platform in the CloudAmihay Zer-Kavod

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran

Data platform architecture principles - ieee infrastructure 2020Julien Le Dem

Similar to Data pipelines observability: OpenLineage & Marquez (20)

Structured Streaming in Spark

Introduction to Structured Data Processing with Spark SQL

Deploying Data Science Engines to Production

Machine learning and big data @ uber a tale of two systems

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms

Real time analytics on deep learning @ strata data 2019

A compute infrastructure for data scientists

Enterprise guide to building a Data Mesh

RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Data Discovery and Metadata

Gobblin @ NerdWallet (Nov 2015)

C2_W1---.pdf

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

Graph Data Science at Scale

Data Platform in the Cloud

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...

Data platform architecture principles - ieee infrastructure 2020

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Slack Application Development 101 Slidespraypatel2

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Slack Application Development 101 Slides

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Injustice - Developers Among Us (SciFiDevCon 2024)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

A Domino Admins Adventures (Engage 2024)

Driving Behavioral Change for Information Management through Data-Driven Gree...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Data Cloud, More than a CDP by Matt Robison

Handwritten Text Recognition for manuscripts and early printed texts

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Data pipelines observability: OpenLineage & Marquez

1. Data Pipelines Observability OpenLineage & Marquez Julien Le Dem CTO & Co-Founder Datakin @J_

2. AGENDA ● The need for metadata ● OpenLineage: open standard for metadata and lineage collection ● Marquez: a reference implementation

3. The need for Metadata

4. Building a healthy data ecosystem Team A Team B Team C

5. ● What is the data source? ● What is the schema? ● Who is the owner? ● How often is it updated? ● Where is it coming from? ● Who is using the data? ● What has changed? Today: Limited context

6. Maslow’s Data hierarchy of needs New Business Opportunities Business Optimization Data Quality Data Freshness Data Availability

7. OpenLineage

8. OpenLineage contributors Creators and contributors from major open source projects involved

9. Purpose Deﬁne an Open standard for metadata and lineage collection by instrumenting data pipelines as they are running.

10. Purpose: EXIF for data pipelines

11. Problem Before: ● Duplication of effort: Each project has to instrument all jobs ● Integrations are external and can break with new versions ● Effort of integration is shared ● Integration can be pushed in each project: no need to play catch up With Open Lineage

12. Open Lineage scope Not in scope Backend Integrations Metadata and lineage collection standard Warehouse Schedulers ... Kafka topic Graph db HTTP client Consumers Kafka client GraphDB client ...

13. Core Model ● JSONSchema spec ● Consistent naming: ○ Jobs: Example: scheduler.job.task ○ Datasets: Example: instance.schema.table

14. Protocol ● Asynchronous events: unique run id for identifying a run and correlate events ○ Run Start event ■ source code version ■ run parameters ○ Run Complete event ■ input dataset ■ output dataset version and schema ● Conﬁgurable backend ○ Kafka ○ Http ○ ...

15. Facets ● Extensible: Facets are atomic pieces of metadata identified by a unique name that can be attached to the core entities. ● Decentralized: Prefixes in facet names allow the definition of Custom facets that can be promoted to the spec at a later point.

16. Facet examples Dataset: - Stats - Schema - Version - Column level lineage Job: - Source code - Dependencies - params - Source control - Query plan - Query profile Run: - Schedule time - Batch id

17.

18. Metadata: Ingest Storage Compute Streaming Batch/ML ● Data Platform built around Marquez ● Integrations ○ Ingest ○ Storage ○ Compute Flink Airflow Kafka Iceberg / S3 BI OpenLineage

19. Marquez: Data model Job Dataset Job Version Run * 1 * 1 * 1 1 * 1 * Source 1 * ● ● ● ● ● ● ● ● ● ● ● Dataset Version

20. API ● Open Lineage and Marquez standardize metadata collection ○ Job runs ○ Parameters ○ Version ○ Inputs / outputs ● Datakin enables ○ Understanding operational dependencies ○ Impact analysis ○ Troubleshooting: What has changed since the last time it worked? Datakin leverages Marquez metadata Lineage analysis Graph Integrations

21. Join the conversation OpenLineage: Github: github.com/OpenLineage Slack: OpenLineage.slack.com Twitter: @OpenLineage Email: groups.google.com/g/openlineage Marquez: Github: github.com/MarquezProject/marquez Slack: MarquezProject.slack.com Twitter: @MarquezProject

22. Thank You *we’re hiring! jobs@datakin.com

Data pipelines observability: OpenLineage & Marquez

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data pipelines observability: OpenLineage & Marquez

Similar to Data pipelines observability: OpenLineage & Marquez (20)

More from Julien Le Dem

More from Julien Le Dem (19)

Recently uploaded

Recently uploaded (20)

Data pipelines observability: OpenLineage & Marquez