SlideShare a Scribd company logo
1 of 30
Download to read offline
Atlan <> Airflow
Integration
Case Study
Created by: Subrat Kumar Dash
ii
TABLE OF CONTENTS
Introduction TO OPPORTUNITY ........................3
1. ATLAN..........................................................4
2. AIRFLOW......................................................6
3. OPENLINEAGE ..........................................15
4. Partnerships with Astronomer.................26
5. REFERENCE LINKS ...................................30
3 | P a g e
INTRODUCTION TO OPPORTUNITY
Overview
This quarter, Atlan is looking to deeply integrate with Airflow using OpenLineage
At Atlan, we strongly believe in communicating our ideas across team async via a notion page
(typically 500-1500 words). This marks the first step towards kicking off new initiatives and
opens a space for conversation, questions, and comments from the team. The team is also
able to learn about the initiatives through additional resources and reading materials available
on the page.
Your task
Write a document explaining why building a deeper integration between Atlan and Airflow is
important, what this would mean for our product, and how this will create an impact for
customers, and eventually business. What use cases should we focus on to deliver impact and
our path forward?
From a platform perspective, we’ll use the OpenLineage framework to integrate with Airflow.
Open lineage
If you had to explain and onboard an internal team of product managers, engineers, and
leaders across Sales Engineering, Product, Engineering, and Partnerships.Create some
material to onboard everyone on the OpenLineage concept
What, Why & Impact
1. What does the ideal Airflow integration look like for Atlan
2. Why is integration important
3. What is the market asking wrt to Airflow, and what are the biggest gaps that a product
like Atlan can fill?
4. What are the use cases & user stories for an Airflow integration for Atlan? Bonus if you
can find real-world examples
5. Who will be the personas in the data team that’ll use such an integration
6. What this would mean for our product, how this will create an impact on customers
Partnerships
Astronomer provides a hosted Airflow product. Think through how Atlan can run a partnership
with them. What will be the benefits of the partnerships for both companies? What will the ideal
product integration look like? Please add more relevant pointers if you were driving this
partnership as a product person in Atlan.
4 | P a g e
1. ATLAN
Atlan: The Active Metadata Platform for Modern Data Teams
What is Atlan?
The Netflix for Data Team
• Atlan is an active metadata platform that helps modern data teams discover,
understand, trust, and collaborate on data assets.
• It provides a single source of truth for all of your data, and it makes it easy to find, share,
and understand data.
• Atlan also integrates with a wide range of other tools, so you can use it to manage your
entire data stack.
Why use Atlan?
Atlan helps you to:
• Search across your data universe using natural language, business context or using
SQL syntax.
• Visualize column-level relationships from your data sources to your BI dashboards.
• Build trust in data through effortless compliance, context-rich trust signals, and role-
based access controls.
• Empower self-service by connecting the metrics and data that matter to your business.
• Bring context where your humans of data work — in BI dashboards, Github, Slack, Jira,
and more.
• Free up your day with automated metadata enrichment across your data landscape.
How does Atlan work?
• Atlan ingests metadata from your data sources and stores it in a central repository.
• You can then use Atlan's search and discovery features to find data assets, and you can
use Atlan's lineage tracking and auditing features to understand how data is used.
• Atlan also allows you to collaborate with your team on data assets, and it allows you to
automate your data workflows.
Who uses Atlan?
• Atlan is used by a variety of companies, including Flipkart, Ola, and Zomato.
• It is also used by several government agencies and educational institutions.
• Few other Customer are Wework, Unilever, Nasdaq , Juniper and more.
5 | P a g e
Atlan Product List
• Discovery & Catalog
• Data Asset 360
• Column-level Lineage
• Active Data Governance
• Embedded Collaboration
• Glossary
New in Atlan
Atlan AI: Ushering in the Future of Working with Data
• Atlan AI is a new product from Atlan that uses artificial intelligence to help data teams
discover, understand, and use data more effectively.
• Atlan AI uses a variety of AI techniques, including natural language processing, machine
learning, and graph analysis, to extract insights from data.
• Atlan AI can help data teams to:
o Identify and understand the relationships between different data sets
o Discover hidden patterns and trends in data
o Generate insights that can be used to make better business decisions
• Atlan AI is still in beta, but it has the potential to revolutionize the way that data teams
work.
6 | P a g e
2. AIRFLOW
What is Airflow?
Apache Airflow™ is an open-source platform to programmatically author, schedule, and
monitor workflows. These workflows essentially represent your data pipelines.
Why we need Airflow?
(Example of ELT process)
Imagine you have the following data pipeline with three tasks: extract, load, and transform,
running every day at specific time. At each step, you interact with an external tool or system,
such as an API for extract, Snowflake for load, and DBT for transform.
Now, consider the scenarios where the API is no longer available, there's an error in Snowflake,
or you made a mistake in your DBT transformations. As you can see, any of these steps could
result in a failure, and you need a tool to manage these potential issues.
Additionally, what if you not only have 1 data pipeline but 100 of them? Managing this
manually would be a nightmare.
That's where Airflow comes in. With Airflow, you can automatically handle failures, even when
dealing with hundreds of data pipelines and millions of tasks.
7 | P a g e
What is the benefit using Airflow?
Airflow offers numerous benefits that make it a powerful tool for orchestrating data pipelines.
1. Dynamic and Pythonic: Airflow's workflows are coded in Python, making it dynamic and
flexible. Unlike static languages like XML used in tools like Oozie, Airflow leverages the
power and versatility of Python for defining workflows.
2. Scalability: With Airflow, you can execute as many tasks as needed, enabling scalability
to handle large and complex data pipelines.
3. User Interface: Airflow provides a visually appealing and intuitive user interface, allowing
users to monitor tasks and data pipelines easily.
4. Extensibility: Airflow is highly extensible, enabling users to add their own plugins and
custom functionalities without waiting for external updates. This flexibility empowers
users to tailor Airflow to their specific needs.
5. Orchestration: As an orchestrator, Airflow ensures that tasks are executed in the right
order and at the right time, maintaining the flow of data through the pipeline effectively.
What is not Airflow?
Airflow is not a data streaming solution, neither a data processing framework. Airflow does not
require you to process your data within Airflow itself. Instead, data processing should occur
within your tasks, using appropriate operators.
Core Components of Airflow
1. DAG(Directed Acyclic Graphs): It is a fundamental concept and the core building block
used to define and manage workflows. It represents a collection of tasks and their
dependencies. It ensures tasks are executed in a linear sequence without forming
loops, allowing efficient scheduling and automation of workflows.
2. The Airflow scheduler parses DAGs, verifies their schedule interval, and if the schedule
is due, it schedules the tasks within the DAG for execution by passing them to the
Airflow workers.
3. The Airflow workers pick up tasks scheduled for execution and carry out the actual
execution, making them responsible for "doing the work."
8 | P a g e
4. The Airflow webserver visualizes the DAGs parsed by the scheduler and serves as the
main interface for users to monitor DAG runs and their outcomes.
5. The metadata store in Airflow, is a database compatible with SQL Alchemy. This
database stores metadata related to your data pipelines, tasks, Airflow users, and other
relevant information.
Schematic overview of the process involved in developing and executing pipelines as DAGs
using Airflow.
1. Data
Eng. Write
workflow
as DAG
2. Airflow Scheduler parse the DAG
and Schedule task as per DAG,
including dependency
Airflow Scheduler
Read DAG Files
(task,
dependency and
schedule interval)
If task scheduled
passed, schedule
DAG task
Check
dependency and
add task to
queue
Wait X second
DAG File
(Python)
DAG
Folder/
Directory
3. Airflow workers
execute scheduled
tasks
Airflow
Worker
Execution
Queue
Execute
Task
Retrieve
Task Result
Store Serialized DAGS
Store Task Result
4. Airflow metastore
(database) Support
SqlAlchemy
Airflow
Webserver
Retrieve DAGS and Task result
5. Airflow Webserver
UI used to monitor
execution
User
9 | P a g e
Growth of Airflow and Opportunity for Atlan
Growth of Airflow and its popularity.
Few Critical point to consider as Atlan point of View from Airflow Survey 2022.
1. User Roles: More than half of Airflow users are Data Engineers (54%). Other active users
include Solutions Architects (13%), Developers (12%), DevOps (6%), and Data Scientists
(4%).
So what? - The user roles described in the Airflow User Survey align with the potential
user personas at Atlan.
2. Company Size: Airflow is popular in larger companies, with 64% of users working in
companies with 200+ employees, showing an 11% increase from 2020.
So what? - Atlan can cater to the data management requirements of these bigger
companies, increasing its market reach and attracting potential customers who are
already invested in Airflow.
3. Willingness to Recommend: 65.9% of surveyed Airflow users are willing to recommend
Apache Airflow, with a positive trend in willingness to recommend over previous years
(85.7% in 2019, 92% in 2020).
So what? For Atlan, this information is crucial as it indicates that integrating with Airflow
can potentially expose Atlan to a user base that already finds value in Airflow. Atlan can
tap into this positive sentiment and further increase the likelihood of recommendation
among its users.
4. Deployments: Most Airflow users (85%) have between 1 to 7 active Airflow instances,
and 62.5% have between 11 to 250 DAGs in their largest Airflow instance.
10 | P a g e
So what? By catering to users who manage multiple instances and DAGs, Atlan's
integration can enhance data orchestration capabilities, and improve efficiency for
users who deal with large-scale data operations.
5. Airflow Versions: Close to 85% of users use one of the Airflow 2 versions. While 9.2%
still use version 1.10.15, most users on Airflow 1 are planning migration to Airflow 2
soon.
So what? - Integrating with the latest Airflow 2 versions ensures that Atlan can offer
more to market with less effort.
6. Monitoring and Executors: Users show more interest in monitoring, with more interest in
external monitoring services (40.7%) and information from metabase (35.7%). The most
common executors used are Celery (52.7%) and Kubernetes (39.4%).
So what? - By providing seamless integration with external data catalog and bringing
meta data information from metabase, Atlan can empower users with comprehensive
insights into the performance and health of their data workflows.
7. Usage and Lineage: 81.3% of Airflow users don't have any customization of Airflow.
Xcom (69.8%) is the most popular method for passing inputs and outputs between
tasks. Lineage is a new topic for many users, with 47.5% interested if supported by
Airflow, 29% not familiar, and 13% not concerned about data lineage.
So what? By offering robust data lineage capabilities as part of the integration, Atlan can
meet the demands of users interested in data lineage and encourage adoption among
those who may not be familiar with the concept.
Limitations of Airflow
1. High Complexity and Non-Intuitive: Airflow's "pipeline/workflow as code" approach is
powerful but can lead to complexity and non-intuitive development. Creating tasks in
DAGs and sequencing them as workflows requires decent Python programming skills.
Some users question the need to have workflows as code instead of configurations, as
coding workflows can make debugging in production counter intuitive.
2. Adoption & Steep learning curve: Airflow's adoption has a steep learning curve for data
engineers due to its complex framework, requiring them to understand the scheduler,
executor, and workers. Additionally, the extensive list of configuration parameters, while
offering flexibility, demands significant time investment.
3. Airflow lacks pipeline versioning, leading to the absence of stored workflow metadata
changes. This limitation makes it challenging to track workflow history and perform
pipeline rollbacks, especially in production environments.
11 | P a g e
4. Debugging with Airflow can be a nightmare as workflows become complex, leading to
numerous interdependencies and potential issues. The use of operators and templates
in DAGs, interchangeability, and the ability to execute code or orchestrate workflows
add further complexity. Local and remote executors also pose challenges, making root
cause analysis daunting in such intricate environments.
5. Data lineage in Airflow is rudimentary and marked as "experimental - subjected to
change" in the documentation. As pipelines evolve, it requires rechecking and possibly
reworking the DAG. Airflow lacks native information about jobs, owners, or repositories,
making it challenging to understand downstream impacts from upstream changes.
How does Atlan fill the Airflow Gap without diverting its strategy?
Scenario:
An executive at a large e-commerce company relies on a real-time sales dashboard to monitor
daily sales performance across different product categories and regions. One morning, while
reviewing the dashboard, the executive notices a sudden drop in sales for a specific product
category in a particular region. Concerned about the potential impact on revenue, the executive
immediately raises the issue with the data team to investigate and resolve the problem.
12 | P a g e
Before Atlan Integration:
The data team, consisting of data analysts and engineers, starts investigating the issue. They
use Apache Airflow to manage the data pipelines responsible for fetching, processing, and
loading sales data into the dashboard. However, tracking the issue becomes challenging due
to the complex workflow and lack of pipeline context. The data team spends several hours
manually debugging the pipelines and pulling logs from Airflow, trying to pinpoint the exact
cause of the sales drop.
Solution with Atlan Integration:
Dashboard Alert and Pipeline Context: With Atlan's integration with Apache Airflow, the data
team has set up dashboard alerts that automatically signal any issues in the pipelines. As
soon as the sales drop is detected, an alert appears on the dashboard, notifying the executive
of the anomaly.
Pipeline Status: The executive opens the Atlan data asset profile associated with the sales
dashboard. Within the profile, the executive can quickly check the pipeline status and metrics,
including data freshness and run schedule. This information provides immediate context
about the pipelines' health and helps the executive understand if the issue is due to delayed
data updates or other pipeline-related problems.
Data Lineage Visualization: The data team leverages Atlan's data lineage visualization feature
to track the sales data flow from source systems to the dashboard. They quickly identify that a
change in the data transformation pipeline caused the sales drop in the specific region. With
this insight, the data team can focus their efforts on debugging and resolving the issue more
effectively.
Collaborative Data Analysis: Atlan's integration allows the data team to collaborate more
efficiently. Data analysts, scientists, and business users can access pipeline context and data
lineage information within Atlan, enabling them to understand the issue in detail and contribute
to the resolution.
Data Security and PII Data Masking: With Atlan's data security and governance features, the
data team ensures that sensitive Personally Identifiable Information (PII) data is securely
handled throughout the data pipeline. PII data masking is applied to protect customer privacy
and comply with data regulations. The executive can rest assured that data security measures
are in place to safeguard sensitive information.
13 | P a g e
Atlan and Airflow together create an ecosystem of trust and transparency for data teams,
fostering better data engineering experiences focused on building knowledge and confidence
in their data.
Through this integration, Atlan enriches data assets with essential pipeline context. Metadata
from Airflow pipelines can be seamlessly shared with Atlan data asset profiles, granting
access to data analysts, scientists, and business users. This transparency enables data teams
and consumers to stay informed about the status of each data asset's associated pipeline.
Heading
CRM
ERP
Billing
Marketing
Reporting
Analytics
Data Mining
Meta Data
from
Upstream
Meta Data
from data
ware house
Meta Data
from airflow
pipeline
Meta Data from
airflow
pipeline
Meta Data from
Downstream
Current Gap in Airflow
can be filled by Atlan
Integration between Atlan and Apache Airflow
Who will be the personas in the data team that’ll use such an integration?
• Data Engineers
• Data Analysts
• Data Scientists
14 | P a g e
• Business Executives (Business Users and Stakeholders)
• Data Steward Team
Persona Data Engineer
Job As a Data Engineer, the primary responsibility, is to design, build, and
maintain data pipelines and infrastructure to ensure the efficient and
reliable flow of data within the organization.
Goal The main goal of a Data Engineer is to create scalable and robust data
pipelines that support the organization's data-driven initiatives. They
aim to deliver high-quality data that is accurate, consistent, and
accessible for analytics and business decision-making.
Problem
Data Engineers often face challenges in managing complex data
workflows and maintaining data quality. Some common problems they
encounter include:
• Data Quality and Integrity: Ensuring data quality and integrity
throughout the pipeline is crucial. Inaccurate or incomplete data
can lead to incorrect analyses and decisions.
• Monitoring and Debugging: Identifying and resolving issues in
data pipelines quickly is essential to minimize downtime and
maintain data reliability.
Without Data
lineage
Functionality
Without a robust solution in place, Data Engineers may struggle to meet
their data engineering goals effectively. Manual monitoring, debugging,
and handling data quality issues can be time-consuming and prone to
human errors. Lack of transparency and visibility into pipeline health
may lead to delayed issue detection and resolution.
With Atlan
Data
Lineage
Functionality
With Atlan's integration, the Data Engineer gains immediate visibility
into the pipeline context and status. They access the data asset profile
associated with the pipeline, which shows relevant metrics, data
freshness, and lineage information. Leveraging data lineage
visualization, the Data Engineer quickly pinpoints the source of the
slowdown to a specific data source. The issue is traced to a schema
change in the data source, impacting the data transformation process.
15 | P a g e
3. OPENLINEAGE
What Is Data Lineage?
Data Lineage is the process of tracking and documenting relationships among datasets within
an organization's ecosystem. It traces data flow from source to destination, establishing
parent/child relationships. This transparency aids data governance, compliance, quality
management, and issue troubleshooting, ensuring data reliability and accuracy.
Why we need data lineage?
Data hierarchy need
Data hierarchy starts with gaining access to interesting data, and then the focus shifts to real-
time updates. Availability and freshness of data become essential, followed by ensuring data
quality. Consistently supplied with good data, an organization can optimize and improve
processes, identify anomalies, and drive business benefits. Data lineage is vital in achieving
these higher-order benefits.
New
Oppertunity
Business
Oppertunity
Data Quality
Data Freshness
Data Availability
16 | P a g e
Misunderstanding of Data Democratization
We are in era of data democratization. The successful democratization of data production and
consumption empowers individuals in the organization to work with data independently,
leading to a fragmented and chaotic data ecosystem. This fragmentation creates a challenge
in understanding critical information about the data, such as its source, schema, ownership,
creation date, and update frequency. The lack of understanding of actual data democratization
turns data into a "black box," making it difficult to trace data origins, transformations, and
usage. While the democratization of data is beneficial for fostering a healthy data ecosystem,
the absence of data lineage can hinder data transparency and impact decision-making
processes.
Principle of Data Democratization
https://www.linkedin.com/feed/update/urn:li:activity:7078983264836747264/
17 | P a g e
What is OpenLineage?
OpenLineage is a framework that enables transparent and standardized data lineage tracking,
providing clear insights into data origins, transformations, and dependencies throughout the
data ecosystem.
OpenLineage is a community-driven initiative with diverse stakeholders interested in data
lineage. Instead of being restricted to a single vendor, it aims to establish an open standard for
data lineage across the industry. The goal is to create a common language for discussing data
movement across various tools, benefiting everyone involved in the ecosystem. Its foremost
purpose is to define an open standard for collecting lineage information collaboratively.
How OpenLineage started?
• In 2017, Julien Le Dem, one of the co-founders of OpenLineage, faced challenges at
WeWork in understanding data pipeline dependencies. To address this, he developed
Marquez, aiming to create a "Google Maps" for data pipelines. Marquez was later open
sourced by WeWork to benefit other companies.
• Marquez is an open-source metadata service for tracking the data lineage of datasets.
Marquez aims to provide a central metadata repository and API to track the provenance
and lineage of datasets, making it easier for data engineers and data scientists to
understand and trace data flow across systems.
• In 2020, the team behind Marquez shifted focus to data observability and launched
OpenLineage is an independent open-source initiative focused on providing a common
18 | P a g e
data lineage standard and framework for the industry. It aims to establish a
standardized way of collecting, storing, and querying data lineage information, enabling
seamless integration across various data tools and frameworks.
How Data Lineage Solved?
Traditional/Google Approach: In this tactic, the source code repository is scanned, and
queries are extracted from the code. The analysis then predicts potential data relationships
and interactions if these queries were to run. This method is beneficial for design-time lineage,
helping understand data structures during application development and predicting data
footprints. However, it does not provide observational lineage since it does not track actual
data execution or runtime information.
Passive Data Lineage- This approach involves integrating with the data store or warehouse
to analyze query, logs, and access information regularly. The data lineage relationships are
extracted from the data source and stored in a central metadata repository. This method is
commonly used for governance purposes.
Data Lineage Repository
Data Lineage Repository
Crawl and parse all the source code
19 | P a g e
Active Data Lineage - Instead of directly accessing the data source, this approach
integrates with the data orchestration system. It observes job executions and tracks their
impact on data. Like passive lineage, the information is reported to a central metadata
repository.
Both approaches have a central repository for data lineage metadata, with the key difference
being where the lineage information is collected from - either from the data source (passive) or
the data orchestration system (active). The choice depends on specific scenarios and
requirements, and both approaches offer valuable insights for data lineage analysis and
management.
How OpenLineage benefits the ecosystem?
Before OpenLineage:
• Each project needs to create custom metadata collection integrations, leading to
duplicated efforts.
• External integrations can break with updates to the scheduler or data processing
framework, requiring backward compatibility checks.
Data Lineage Repository
20 | P a g e
https://openlineage.io/docs/
With OpenLineage:
• Integration efforts are shared across projects, reducing duplication.
• Integrations can be seamlessly integrated into the underlying scheduler and/or data
processing framework, eliminating compatibility concerns and the need for constant
catch-up.
https://openlineage.io/docs/
21 | P a g e
How OpenLineage fits into eco-system?
A simpler way to look at the OpenLineage architecture is to break it down into three essential
components.
Producers: The components that capture and send metadata about data pipelines and their
dependencies to the OpenLineage server.
Backend: The infrastructure that helps you store, process, and analyze metadata.
Consumers: The components that fetch metadata from the server and feed it to web-based
interfaces so that you can visualize the data lineage mapping.
Tech Stack
The OpenLineage tech stack comprises:
• Metadata Collection APIs: Integrates with diverse data sources and data pipeline tools
such as Spark, Airflow, and dbt.
• Lineage Server (Marquez): Serves as the central repository for storing metadata related
to lineage events.
• Web-based UI: Enables visualization of data lineage and dependencies. REST APIs
facilitate capturing and querying metadata.
• Libraries: Offers libraries for common programming languages to facilitate integration
and usage within different environments.
22 | P a g e
The data model of OpenLineage revolves around the life cycle of a job. Every job run undergoes
a series of events that provide observational lineage updates. Here is a breakdown of the life
cycle and the corresponding events:
• Job Start: When a job starts, an observational lineage update is sent with an event type
of "start." This update includes the start time and the producer's information.
• Job Abort/Job Failed: Throughout the job's execution, additional metadata about input
data sets can be sent along the life cycle to augment the observational lineage updates.
• Job Completion: When the job is complete, a "complete" event is sent, indicating the
end time of the job run.
Standardized Entities: The core model defines generic entities such as datasets, jobs, and runs,
which are uniquely identified using consistent naming strategies. These entities form the
foundation for capturing lineage information across the data ecosystem.
Facets: A facet is an atomic piece of metadata attached to one of the core entities. Facets
enable entity enrichment, making the core model highly extensible. Users can define and
include custom facets to capture additional metadata specific to their use case.
Example of Facets
For DataSet: Stats, Version, Schema
For Job: Source code, Dependencies, Source control, Query plan
23 | P a g e
For Run: Scheduled time, Batch ID
Data model
In the OpenLineage data model:
Dataset: Represents a unit of data, like a table in a database or an object in cloud storage.
Changes to datasets occur when the corresponding job completes its writing process.
Job: Refers to a data pipeline process responsible for creating or consuming datasets. Jobs
may undergo evolution, and documenting these changes is vital for understanding the
pipeline's functioning.
Run: Signifies an executed instance of a job. It contains essential information, including the
start and completion (or failure) time of the job. Runs facilitate observation of job changes
over time.
How it creates lineage?
OpenLineage builds lineage based on correlations between jobs and datasets. When the first job
starts, OpenLineage observes the metadata about that job and the datasets it interacts with.
As subsequent jobs run, they may reference the same datasets, establishing correlations.
Lineage is established by linking jobs to the datasets they create or consume. The correlations
1c0386aa-0979-41e3-9861-
3a330623effa
prod.customer_segmentation
dev.run_unit_tests
postgres://db.foo.com/metrics.sales
orders
s3://sales-metrics/orders.csv
Example
24 | P a g e
are based on data set names, emphasizing the importance of well-defined and consistent
naming conventions for effective lineage tracking.
Correlate Correlate Correlate
Observe Observe Observe
OpenLineage features
OpenLineage offers a comprehensive set of features to collect, track, and analyze lineage
within your data ecosystem. While still in development, some key features include:
Metadata Capture:
• Technical metadata for datasets and jobs, including schema, version, ownership details,
run time, and job dependencies.
• Dataset-level and column-level data quality metrics, such as row count, byte size, null
count, distinct count, average, min, max, and quantiles.
• Column-level lineage tracing for Apache Spark data.
Seamless Integration:
• Integration with various ETL frameworks, data orchestration engines, data quality
engines, and data lineage tools.
• Support for popular data stores and warehouses, including Amazon S3, Amazon
Redshift, HDFS, Google BigQuery, PostgreSQL, Azure Synapse, and Snowflake.
• Integration with data orchestration tools like Apache Airflow, Apache Spark, and dbt.
• Compatibility with data lineage tools like Egeria.
These features empower data teams with enhanced visibility and control over their data
pipelines, enabling efficient data management and governance throughout the data lifecycle.
Endless Use case of OpenLineage
1. Gain visibility into the dependencies between datasets and jobs, understanding how
data flows across the entire ecosystem make data observability easy.
2. Quickly identify the root cause of data-related issues or errors by tracing lineage and
understanding the flow of data.
25 | P a g e
3. Map the impact of changes in datasets or jobs across the data ecosystem, enabling
proactive decision-making.
4. Perform precise backfills by understanding the lineage of affected datasets and jobs,
ensuring data consistency.
5. Track changes to datasets and jobs over time, facilitating effective change
management and version control.
6. Analyze historical data lineage to gain insights into data evolution and usage patterns.
7. Conduct automated audits of data pipelines, ensuring compliance and data governance
standards are met.
Adoption and Additional Info
• Amazon: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-
mwaa-with-openlineage/
• Menta: https://openlineage.io/blog/manta-integration/
• "Berlin Buzzwords 2023: Column-level lineage is coming to the rescue”
• Getting Started: https://openlineage.io/getting-started/
• The lineage API: https://openlineage.io/blog/explore-lineage-api/
• Facets: https://openlineage.io/blog/dataquality_expectations_facet/
• Spark example: https://openlineage.io/blog/openlineage-spark/
26 | P a g e
4. PARTNERSHIPS WITH ASTRONOMER
What problem does astronomer Solves?
Astronomer solves several problems related to managing and operating Apache Airflow, a
popular open-source data orchestration tool.
• Infrastructure Management: Setting up, managing, and maintaining the underlying
infrastructure for Airflow can be complex and time-consuming. Astronomer provides a
managed service that takes care of infrastructure provisioning, ensuring that data
teams can focus on building and managing data pipelines rather than dealing with
infrastructure tasks.
• Scalability: As data pipelines and workflows grow in complexity and scale, managing
the scalability of Airflow becomes crucial. Astronomer deploys Airflow on Kubernetes,
which offers scalable and flexible infrastructure, allowing Airflow instances to handle
large volumes of data and complex workflows.
• Deployment and Setup: Astronomer offers a simplified and streamlined deployment
process for Airflow instances. With single-click provisioning, users can quickly set up
and get started with Airflow, reducing the time and effort required to configure and
launch Airflow environments.
• Monitoring and Support: Astronomer provides monitoring and support services to
ensure the smooth running of Airflow instances. This includes performance monitoring,
troubleshooting assistance, and timely support to address any issues that may arise
during pipeline execution.
• Security and Authentication: Astronomer provisions auxiliary services, including logging
and authentication, to enhance the security of Airflow instances and data processing.
This helps data teams ensure that their data pipelines are secure and compliant with
data privacy regulations.
• Integration with Data Tools: Astronomer integrates with various data sources, data
stores, and other tools commonly used in data engineering workflows. This seamless
27 | P a g e
integration simplifies the process of building data pipelines and allows data teams to
leverage their existing data ecosystem.
In summary, Astronomer solves the challenges associated with managing, scaling, and
operating Airflow, enabling data teams to focus on building efficient and reliable data pipelines
and workflows without the complexities of infrastructure management.
About Astronomer:
Astronomer is a global remote-first company with teams around the world and offices in
Cincinnati, Ohio; New York; San Francisco; and San Jose, Calif. Customers in more than 35
countries trust Astronomer as their partner for data orchestration.
Customer List
• Astronomer's demand is experiencing rapid growth as more teams seek to understand
their data landscape and harness the potential of data orchestration.
• Astronomer serves a diverse customer base, ranging from innovative startups to major
retailers worldwide.
• A Fortune 10 customer achieved a remarkable $300 million increase in sales by
leveraging Astronomer's capabilities.
28 | P a g e
• The platform is trusted by nine out of the top 12 gaming companies and three of the top
10 Wall Street banks.
• G2 rating 4.7 out of 5.
• Gartner, leading research, and advisory company, recognized Astronomer as a Cool
Vendor in Data for Artificial Intelligence and Machine Learning in 2021, validating its
effectiveness and potential.
• Astronomer's team plays a crucial role in the development of Airflow, with 16 of the top
25 all-time contributors being from Astronomer, influencing the platform's direction
since 2018.
• With the acquisition of Datakin, Astronomer enhances its offering by providing data
teams with native operational lineage capabilities integrated into their orchestration
platform. Astronomer aims to set up end-to-end lineage so that its customers can
observe and add context to disparate pipelines.
• Astro, the fully managed data orchestration platform designed and operated by the core
developers behind Apache Airflow and OpenLineage, is now available on Amazon Web
Services (AWS), Microsoft Azure, and Google Cloud in 47 regions across six continents.
Partnership with Astronomer
Benefits for Atlan
• Increased visibility into Airflow pipelines. Atlan could integrate with Astronomer to
provide a holistic view of all Airflow pipelines, including their dependencies, lineage, and
metrics. This would give Atlan users a better understanding of how their data is flowing
through their pipelines and help them to identify any potential problems.
• Increased adoption of Atlan. By partnering with Astronomer, Atlan could reach a wider
audience of Airflow users. This would help to increase the adoption of Atlan, and make
it the go-to tool for managing Airflow pipelines.
• Addition Changes in Airflow: Astronomer's team already has a strong relationship with
the Airflow community, this partnership would give Atlan greater visibility and credibility
within the Airflow ecosystem. It would create opportunities for Atlan to engage with the
29 | P a g e
Airflow community, gather feedback, and contribute back to the open-source project,
further solidifying its position as a leading player in the data orchestration space.
• With Astronomer's acquisition of Datakin and the addition of native operational lineage
capabilities to their orchestration platform, Atlan can benefit from a seamless
integration of data lineage within Airflow.
Benefits for Astronomer
• Simplified Data Discovery: Atlan's data catalog can integrate with Astronomer to provide
a unified view of datasets and their lineage. Data engineers using Astronomer can
easily discover and access relevant datasets from Atlan's catalog, streamlining their
workflow.
• Enhanced Data Governance: Atlan's data governance capabilities can complement
Astronomer's managed Airflow service by providing data cataloging, data lineage, and
access control features. This integration ensures data pipelines are well-documented
and governed, enhancing data quality and compliance.
• Product Roadmap Alignment: Atlan's expertise in data management trends and
technology advancements can ensure that Astronomer remains at the forefront of
delivering cutting-edge solutions.
• Joint Workshops and Webinars: Atlan and Astronomer can organize joint workshops,
webinars, and knowledge-sharing events to educate the data community about the
benefits of combining data orchestration and operational lineage. These events could
showcase the power of their combined offerings and attract new customers.
The ideal product integration would allow users to seamlessly switch between Atlan and
Astronomer. This would make it easy for users to manage their Airflow pipelines and to track
the lineage of their data and directly check the issue in pipeline.
Thanks You
Subrat Kumar Dash
30 | P a g e
5. REFERENCE LINKS
1. What is Airflow™? — Airflow Documentation (apache.org)
2. https://www.astronomer.io/airflow/
3. https://www.linkedin.com/pulse/apache-airflow-challenges-ranganath-
s/?trackingId=yEYBjnRBQBq3b3FGDwsahw%3D%3D
4. https://openlineage.io/
5. https://atlan.com/open-lineage/#openlineage-architecture
6. https://www.youtube.com/watch?v=2s013GQy1Sw&ab_channel=Astronomer

More Related Content

What's hot

Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Neo4j
 
The Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud WorldThe Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud WorldDATAVERSITY
 
DAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data ModelerDAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data ModelerDATAVERSITY
 
A Real World Case Study for Implementing an Enterprise Scale Data Fabric
A Real World Case Study for Implementing an Enterprise Scale Data FabricA Real World Case Study for Implementing an Enterprise Scale Data Fabric
A Real World Case Study for Implementing an Enterprise Scale Data FabricNeo4j
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 
Straight Talk to Demystify Data Lineage
Straight Talk to Demystify Data LineageStraight Talk to Demystify Data Lineage
Straight Talk to Demystify Data LineageDATAVERSITY
 
Module Architectuurprincipes voor NAF Masterclass Enterprise Architectuur
Module Architectuurprincipes voor NAF Masterclass Enterprise ArchitectuurModule Architectuurprincipes voor NAF Masterclass Enterprise Architectuur
Module Architectuurprincipes voor NAF Masterclass Enterprise ArchitectuurDanny Greefhorst
 
A cloud readiness assessment framework
A cloud readiness assessment frameworkA cloud readiness assessment framework
A cloud readiness assessment frameworkCarlo Colicchio
 
Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)James Serra
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
 
Data Management Maturity Assessment
Data Management Maturity AssessmentData Management Maturity Assessment
Data Management Maturity AssessmentFiras Hamdan
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...DATAVERSITY
 
Transform Your Application Portfolio - and Keep Your Focus!
Transform Your Application Portfolio - and Keep Your Focus!Transform Your Application Portfolio - and Keep Your Focus!
Transform Your Application Portfolio - and Keep Your Focus!Software AG
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
 
2019 07 Bizbok with Archimate 3 v3 [UPDATED !]
 2019 07 Bizbok with Archimate 3 v3 [UPDATED !] 2019 07 Bizbok with Archimate 3 v3 [UPDATED !]
2019 07 Bizbok with Archimate 3 v3 [UPDATED !]COMPETENSIS
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best PracticesBoris Otto
 
DMBOK - Chapter 1 Summary
DMBOK - Chapter 1 SummaryDMBOK - Chapter 1 Summary
DMBOK - Chapter 1 SummaryNicolas Ruslim
 

What's hot (20)

Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
Banking Circle: Money Laundering Beware: A Modern Approach to AML with Machin...
 
The Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud WorldThe Importance of DataOps in a Multi-Cloud World
The Importance of DataOps in a Multi-Cloud World
 
DAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data ModelerDAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
DAS Slides: Data Architect vs. Data Engineer vs. Data Modeler
 
A Real World Case Study for Implementing an Enterprise Scale Data Fabric
A Real World Case Study for Implementing an Enterprise Scale Data FabricA Real World Case Study for Implementing an Enterprise Scale Data Fabric
A Real World Case Study for Implementing an Enterprise Scale Data Fabric
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 
Straight Talk to Demystify Data Lineage
Straight Talk to Demystify Data LineageStraight Talk to Demystify Data Lineage
Straight Talk to Demystify Data Lineage
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 
Module Architectuurprincipes voor NAF Masterclass Enterprise Architectuur
Module Architectuurprincipes voor NAF Masterclass Enterprise ArchitectuurModule Architectuurprincipes voor NAF Masterclass Enterprise Architectuur
Module Architectuurprincipes voor NAF Masterclass Enterprise Architectuur
 
A cloud readiness assessment framework
A cloud readiness assessment frameworkA cloud readiness assessment framework
A cloud readiness assessment framework
 
Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
 
Data Management Maturity Assessment
Data Management Maturity AssessmentData Management Maturity Assessment
Data Management Maturity Assessment
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
 
Transform Your Application Portfolio - and Keep Your Focus!
Transform Your Application Portfolio - and Keep Your Focus!Transform Your Application Portfolio - and Keep Your Focus!
Transform Your Application Portfolio - and Keep Your Focus!
 
Glossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data GovernanceGlossaries, Dictionaries, and Catalogs Result in Data Governance
Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
2019 07 Bizbok with Archimate 3 v3 [UPDATED !]
 2019 07 Bizbok with Archimate 3 v3 [UPDATED !] 2019 07 Bizbok with Archimate 3 v3 [UPDATED !]
2019 07 Bizbok with Archimate 3 v3 [UPDATED !]
 
MAPPING TOGAF® ADM AND AGILE APPROACH
MAPPING TOGAF® ADM AND AGILE APPROACHMAPPING TOGAF® ADM AND AGILE APPROACH
MAPPING TOGAF® ADM AND AGILE APPROACH
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
 
DMBOK - Chapter 1 Summary
DMBOK - Chapter 1 SummaryDMBOK - Chapter 1 Summary
DMBOK - Chapter 1 Summary
 

Similar to Atlan to Airflow integration.pdf

Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfan
 
DataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxDataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxJohn J Zhao
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
 
Jira dashboards gadgets
Jira dashboards gadgetsJira dashboards gadgets
Jira dashboards gadgetsSajit Nair
 
Olap and metadata
Olap and metadata Olap and metadata
Olap and metadata Punk Milton
 
OMC_LogAnalytics_DataSheet
OMC_LogAnalytics_DataSheetOMC_LogAnalytics_DataSheet
OMC_LogAnalytics_DataSheetHarish Akali
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterKamran Munshi
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfAzure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfMaheshPandit16
 
Laravel framework - An overview and usability for web development
Laravel framework - An overview and usability for web developmentLaravel framework - An overview and usability for web development
Laravel framework - An overview and usability for web developmentifour_bhavesh
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAucfan
 
A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...Axiell ALM
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasDataWorks Summit
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Multi-Cloud Services
Multi-Cloud ServicesMulti-Cloud Services
Multi-Cloud ServicesIRJET Journal
 
Splunk best practices
Splunk best practicesSplunk best practices
Splunk best practicesJilali HARITI
 

Similar to Atlan to Airflow integration.pdf (20)

Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
 
DataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxDataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptx
 
API Integration
API IntegrationAPI Integration
API Integration
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 
Jira dashboards gadgets
Jira dashboards gadgetsJira dashboards gadgets
Jira dashboards gadgets
 
Olap and metadata
Olap and metadata Olap and metadata
Olap and metadata
 
OMC_LogAnalytics_DataSheet
OMC_LogAnalytics_DataSheetOMC_LogAnalytics_DataSheet
OMC_LogAnalytics_DataSheet
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfAzure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdf
 
Laravel framework - An overview and usability for web development
Laravel framework - An overview and usability for web developmentLaravel framework - An overview and usability for web development
Laravel framework - An overview and usability for web development
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at Aucfanlab
 
A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...A case study of encouraging compliance with Spectrum data standards and proce...
A case study of encouraging compliance with Spectrum data standards and proce...
 
Lotus
LotusLotus
Lotus
 
Open Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache AtlasOpen Metadata and Governance with Apache Atlas
Open Metadata and Governance with Apache Atlas
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Multi-Cloud Services
Multi-Cloud ServicesMulti-Cloud Services
Multi-Cloud Services
 
Splunk best practices
Splunk best practicesSplunk best practices
Splunk best practices
 

More from Subrat Kumar Dash

Clear_Partner Management System_Case Challange_Subrat.pdf
Clear_Partner Management System_Case Challange_Subrat.pdfClear_Partner Management System_Case Challange_Subrat.pdf
Clear_Partner Management System_Case Challange_Subrat.pdfSubrat Kumar Dash
 
Atlan_Product metering_Subrat.pdf
Atlan_Product metering_Subrat.pdfAtlan_Product metering_Subrat.pdf
Atlan_Product metering_Subrat.pdfSubrat Kumar Dash
 
Case study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdf
Case study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdfCase study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdf
Case study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdfSubrat Kumar Dash
 
MPL_Pre-Owned game Console_Challange.pdf
MPL_Pre-Owned game Console_Challange.pdfMPL_Pre-Owned game Console_Challange.pdf
MPL_Pre-Owned game Console_Challange.pdfSubrat Kumar Dash
 
Solving through Systems Thinking.pdf
Solving through Systems Thinking.pdfSolving through Systems Thinking.pdf
Solving through Systems Thinking.pdfSubrat Kumar Dash
 
NGO Pravasi Sathi_Paper Based Survey to Digital.pptx
NGO Pravasi Sathi_Paper Based Survey to Digital.pptxNGO Pravasi Sathi_Paper Based Survey to Digital.pptx
NGO Pravasi Sathi_Paper Based Survey to Digital.pptxSubrat Kumar Dash
 
Duplicate Listing in Property Portal
Duplicate Listing in Property PortalDuplicate Listing in Property Portal
Duplicate Listing in Property PortalSubrat Kumar Dash
 
Personalised Learning for Edtech.pdf
Personalised Learning for Edtech.pdfPersonalised Learning for Edtech.pdf
Personalised Learning for Edtech.pdfSubrat Kumar Dash
 
Improve Operational Efficiency.pdf
Improve Operational Efficiency.pdfImprove Operational Efficiency.pdf
Improve Operational Efficiency.pdfSubrat Kumar Dash
 
Edge_Strategy_Opentext_Supplier_Risk_Management.pdf
Edge_Strategy_Opentext_Supplier_Risk_Management.pdfEdge_Strategy_Opentext_Supplier_Risk_Management.pdf
Edge_Strategy_Opentext_Supplier_Risk_Management.pdfSubrat Kumar Dash
 
MoneyView_Loyalty Program.pdf
MoneyView_Loyalty Program.pdfMoneyView_Loyalty Program.pdf
MoneyView_Loyalty Program.pdfSubrat Kumar Dash
 
Product Design for Inventory Not Found.pdf
Product Design for Inventory Not Found.pdfProduct Design for Inventory Not Found.pdf
Product Design for Inventory Not Found.pdfSubrat Kumar Dash
 
Staying ahead of industry changes.pdf
Staying ahead of industry changes.pdfStaying ahead of industry changes.pdf
Staying ahead of industry changes.pdfSubrat Kumar Dash
 
Increase Stock edge User Engagement
Increase Stock edge User EngagementIncrease Stock edge User Engagement
Increase Stock edge User EngagementSubrat Kumar Dash
 
Strategy to Get More User Generated Content
Strategy to Get More User Generated ContentStrategy to Get More User Generated Content
Strategy to Get More User Generated ContentSubrat Kumar Dash
 

More from Subrat Kumar Dash (20)

Clear_Partner Management System_Case Challange_Subrat.pdf
Clear_Partner Management System_Case Challange_Subrat.pdfClear_Partner Management System_Case Challange_Subrat.pdf
Clear_Partner Management System_Case Challange_Subrat.pdf
 
Atlan_Product metering_Subrat.pdf
Atlan_Product metering_Subrat.pdfAtlan_Product metering_Subrat.pdf
Atlan_Product metering_Subrat.pdf
 
Case study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdf
Case study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdfCase study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdf
Case study-Strategic Evaluation for Launching SMS Channel on Mailchimp.pdf
 
MPL_Pre-Owned game Console_Challange.pdf
MPL_Pre-Owned game Console_Challange.pdfMPL_Pre-Owned game Console_Challange.pdf
MPL_Pre-Owned game Console_Challange.pdf
 
Solving through Systems Thinking.pdf
Solving through Systems Thinking.pdfSolving through Systems Thinking.pdf
Solving through Systems Thinking.pdf
 
NGO Pravasi Sathi_Paper Based Survey to Digital.pptx
NGO Pravasi Sathi_Paper Based Survey to Digital.pptxNGO Pravasi Sathi_Paper Based Survey to Digital.pptx
NGO Pravasi Sathi_Paper Based Survey to Digital.pptx
 
Duplicate Listing in Property Portal
Duplicate Listing in Property PortalDuplicate Listing in Property Portal
Duplicate Listing in Property Portal
 
Mpokket EMI Feature
Mpokket EMI FeatureMpokket EMI Feature
Mpokket EMI Feature
 
Porter Challenge.pdf
Porter Challenge.pdfPorter Challenge.pdf
Porter Challenge.pdf
 
Personalised Learning for Edtech.pdf
Personalised Learning for Edtech.pdfPersonalised Learning for Edtech.pdf
Personalised Learning for Edtech.pdf
 
Improve Operational Efficiency.pdf
Improve Operational Efficiency.pdfImprove Operational Efficiency.pdf
Improve Operational Efficiency.pdf
 
Edge_Strategy_Opentext_Supplier_Risk_Management.pdf
Edge_Strategy_Opentext_Supplier_Risk_Management.pdfEdge_Strategy_Opentext_Supplier_Risk_Management.pdf
Edge_Strategy_Opentext_Supplier_Risk_Management.pdf
 
Garden_PM_Casestudy.pdf
Garden_PM_Casestudy.pdfGarden_PM_Casestudy.pdf
Garden_PM_Casestudy.pdf
 
MoneyView_Loyalty Program.pdf
MoneyView_Loyalty Program.pdfMoneyView_Loyalty Program.pdf
MoneyView_Loyalty Program.pdf
 
Product Design for Inventory Not Found.pdf
Product Design for Inventory Not Found.pdfProduct Design for Inventory Not Found.pdf
Product Design for Inventory Not Found.pdf
 
Staying ahead of industry changes.pdf
Staying ahead of industry changes.pdfStaying ahead of industry changes.pdf
Staying ahead of industry changes.pdf
 
Insurance claim management
Insurance claim managementInsurance claim management
Insurance claim management
 
Increase Stock edge User Engagement
Increase Stock edge User EngagementIncrease Stock edge User Engagement
Increase Stock edge User Engagement
 
Tesla case study
Tesla case studyTesla case study
Tesla case study
 
Strategy to Get More User Generated Content
Strategy to Get More User Generated ContentStrategy to Get More User Generated Content
Strategy to Get More User Generated Content
 

Recently uploaded

Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)Alejandro Cremades
 
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdfبروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdfomnme1
 
Constitution of Company Article of Association
Constitution of Company Article of AssociationConstitution of Company Article of Association
Constitution of Company Article of Associationseri bangash
 
Raising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE VenturesRaising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE VenturesAlejandro Cremades
 
Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...
Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...
Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...Rahul Bedi
 
Team-Spandex-Northern University-CS1035.
Team-Spandex-Northern University-CS1035.Team-Spandex-Northern University-CS1035.
Team-Spandex-Northern University-CS1035.smalmahmud11
 
Hyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseHyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseirhcs
 
Revolutionizing Industries: The Power of Carbon Components
Revolutionizing Industries: The Power of Carbon ComponentsRevolutionizing Industries: The Power of Carbon Components
Revolutionizing Industries: The Power of Carbon ComponentsConnova AG
 
Toyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & TransformationsToyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & TransformationsStefan Wolpers
 
Event Report - IBM Think 2024 - It is all about AI and hybrid
Event Report - IBM Think 2024 - It is all about AI and hybridEvent Report - IBM Think 2024 - It is all about AI and hybrid
Event Report - IBM Think 2024 - It is all about AI and hybridHolger Mueller
 
zidauu _business communication.pptx /pdf
zidauu _business  communication.pptx /pdfzidauu _business  communication.pptx /pdf
zidauu _business communication.pptx /pdfzukhrafshabbir
 
Powers and Functions of CPCB - The Water Act 1974.pdf
Powers and Functions of CPCB - The Water Act 1974.pdfPowers and Functions of CPCB - The Water Act 1974.pdf
Powers and Functions of CPCB - The Water Act 1974.pdflinciy03
 
Unveiling Gemini: Traits and Personality of the Twins
Unveiling Gemini: Traits and Personality of the TwinsUnveiling Gemini: Traits and Personality of the Twins
Unveiling Gemini: Traits and Personality of the Twinsmy Pandit
 
Pitch Deck Teardown: Terra One's $7.5m Seed deck
Pitch Deck Teardown: Terra One's $7.5m Seed deckPitch Deck Teardown: Terra One's $7.5m Seed deck
Pitch Deck Teardown: Terra One's $7.5m Seed deckHajeJanKamps
 
Falcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon investment
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfmstarkes24
 
Sedex Members Ethical Trade Audit (SMETA) Measurement Criteria
Sedex Members Ethical Trade Audit (SMETA) Measurement CriteriaSedex Members Ethical Trade Audit (SMETA) Measurement Criteria
Sedex Members Ethical Trade Audit (SMETA) Measurement Criteriamilos639
 
Engagement Rings vs Promise Rings | Detailed Guide
Engagement Rings vs Promise Rings | Detailed GuideEngagement Rings vs Promise Rings | Detailed Guide
Engagement Rings vs Promise Rings | Detailed GuideCharleston Alexander
 
The Truth About Dinesh Bafna's Situation.pdf
The Truth About Dinesh Bafna's Situation.pdfThe Truth About Dinesh Bafna's Situation.pdf
The Truth About Dinesh Bafna's Situation.pdfMont Surfaces
 
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...Khaled Al Awadi
 

Recently uploaded (20)

Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)Inside the Black Box of Venture Capital (VC)
Inside the Black Box of Venture Capital (VC)
 
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdfبروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
 
Constitution of Company Article of Association
Constitution of Company Article of AssociationConstitution of Company Article of Association
Constitution of Company Article of Association
 
Raising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE VenturesRaising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE Ventures
 
Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...
Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...
Unleash Data Power with EnFuse Solutions' Comprehensive Data Management Servi...
 
Team-Spandex-Northern University-CS1035.
Team-Spandex-Northern University-CS1035.Team-Spandex-Northern University-CS1035.
Team-Spandex-Northern University-CS1035.
 
Hyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseHyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings release
 
Revolutionizing Industries: The Power of Carbon Components
Revolutionizing Industries: The Power of Carbon ComponentsRevolutionizing Industries: The Power of Carbon Components
Revolutionizing Industries: The Power of Carbon Components
 
Toyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & TransformationsToyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & Transformations
 
Event Report - IBM Think 2024 - It is all about AI and hybrid
Event Report - IBM Think 2024 - It is all about AI and hybridEvent Report - IBM Think 2024 - It is all about AI and hybrid
Event Report - IBM Think 2024 - It is all about AI and hybrid
 
zidauu _business communication.pptx /pdf
zidauu _business  communication.pptx /pdfzidauu _business  communication.pptx /pdf
zidauu _business communication.pptx /pdf
 
Powers and Functions of CPCB - The Water Act 1974.pdf
Powers and Functions of CPCB - The Water Act 1974.pdfPowers and Functions of CPCB - The Water Act 1974.pdf
Powers and Functions of CPCB - The Water Act 1974.pdf
 
Unveiling Gemini: Traits and Personality of the Twins
Unveiling Gemini: Traits and Personality of the TwinsUnveiling Gemini: Traits and Personality of the Twins
Unveiling Gemini: Traits and Personality of the Twins
 
Pitch Deck Teardown: Terra One's $7.5m Seed deck
Pitch Deck Teardown: Terra One's $7.5m Seed deckPitch Deck Teardown: Terra One's $7.5m Seed deck
Pitch Deck Teardown: Terra One's $7.5m Seed deck
 
Falcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small Businesses
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdf
 
Sedex Members Ethical Trade Audit (SMETA) Measurement Criteria
Sedex Members Ethical Trade Audit (SMETA) Measurement CriteriaSedex Members Ethical Trade Audit (SMETA) Measurement Criteria
Sedex Members Ethical Trade Audit (SMETA) Measurement Criteria
 
Engagement Rings vs Promise Rings | Detailed Guide
Engagement Rings vs Promise Rings | Detailed GuideEngagement Rings vs Promise Rings | Detailed Guide
Engagement Rings vs Promise Rings | Detailed Guide
 
The Truth About Dinesh Bafna's Situation.pdf
The Truth About Dinesh Bafna's Situation.pdfThe Truth About Dinesh Bafna's Situation.pdf
The Truth About Dinesh Bafna's Situation.pdf
 
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...
 

Atlan to Airflow integration.pdf

  • 1. Atlan <> Airflow Integration Case Study Created by: Subrat Kumar Dash
  • 2. ii TABLE OF CONTENTS Introduction TO OPPORTUNITY ........................3 1. ATLAN..........................................................4 2. AIRFLOW......................................................6 3. OPENLINEAGE ..........................................15 4. Partnerships with Astronomer.................26 5. REFERENCE LINKS ...................................30
  • 3. 3 | P a g e INTRODUCTION TO OPPORTUNITY Overview This quarter, Atlan is looking to deeply integrate with Airflow using OpenLineage At Atlan, we strongly believe in communicating our ideas across team async via a notion page (typically 500-1500 words). This marks the first step towards kicking off new initiatives and opens a space for conversation, questions, and comments from the team. The team is also able to learn about the initiatives through additional resources and reading materials available on the page. Your task Write a document explaining why building a deeper integration between Atlan and Airflow is important, what this would mean for our product, and how this will create an impact for customers, and eventually business. What use cases should we focus on to deliver impact and our path forward? From a platform perspective, we’ll use the OpenLineage framework to integrate with Airflow. Open lineage If you had to explain and onboard an internal team of product managers, engineers, and leaders across Sales Engineering, Product, Engineering, and Partnerships.Create some material to onboard everyone on the OpenLineage concept What, Why & Impact 1. What does the ideal Airflow integration look like for Atlan 2. Why is integration important 3. What is the market asking wrt to Airflow, and what are the biggest gaps that a product like Atlan can fill? 4. What are the use cases & user stories for an Airflow integration for Atlan? Bonus if you can find real-world examples 5. Who will be the personas in the data team that’ll use such an integration 6. What this would mean for our product, how this will create an impact on customers Partnerships Astronomer provides a hosted Airflow product. Think through how Atlan can run a partnership with them. What will be the benefits of the partnerships for both companies? What will the ideal product integration look like? Please add more relevant pointers if you were driving this partnership as a product person in Atlan.
  • 4. 4 | P a g e 1. ATLAN Atlan: The Active Metadata Platform for Modern Data Teams What is Atlan? The Netflix for Data Team • Atlan is an active metadata platform that helps modern data teams discover, understand, trust, and collaborate on data assets. • It provides a single source of truth for all of your data, and it makes it easy to find, share, and understand data. • Atlan also integrates with a wide range of other tools, so you can use it to manage your entire data stack. Why use Atlan? Atlan helps you to: • Search across your data universe using natural language, business context or using SQL syntax. • Visualize column-level relationships from your data sources to your BI dashboards. • Build trust in data through effortless compliance, context-rich trust signals, and role- based access controls. • Empower self-service by connecting the metrics and data that matter to your business. • Bring context where your humans of data work — in BI dashboards, Github, Slack, Jira, and more. • Free up your day with automated metadata enrichment across your data landscape. How does Atlan work? • Atlan ingests metadata from your data sources and stores it in a central repository. • You can then use Atlan's search and discovery features to find data assets, and you can use Atlan's lineage tracking and auditing features to understand how data is used. • Atlan also allows you to collaborate with your team on data assets, and it allows you to automate your data workflows. Who uses Atlan? • Atlan is used by a variety of companies, including Flipkart, Ola, and Zomato. • It is also used by several government agencies and educational institutions. • Few other Customer are Wework, Unilever, Nasdaq , Juniper and more.
  • 5. 5 | P a g e Atlan Product List • Discovery & Catalog • Data Asset 360 • Column-level Lineage • Active Data Governance • Embedded Collaboration • Glossary New in Atlan Atlan AI: Ushering in the Future of Working with Data • Atlan AI is a new product from Atlan that uses artificial intelligence to help data teams discover, understand, and use data more effectively. • Atlan AI uses a variety of AI techniques, including natural language processing, machine learning, and graph analysis, to extract insights from data. • Atlan AI can help data teams to: o Identify and understand the relationships between different data sets o Discover hidden patterns and trends in data o Generate insights that can be used to make better business decisions • Atlan AI is still in beta, but it has the potential to revolutionize the way that data teams work.
  • 6. 6 | P a g e 2. AIRFLOW What is Airflow? Apache Airflow™ is an open-source platform to programmatically author, schedule, and monitor workflows. These workflows essentially represent your data pipelines. Why we need Airflow? (Example of ELT process) Imagine you have the following data pipeline with three tasks: extract, load, and transform, running every day at specific time. At each step, you interact with an external tool or system, such as an API for extract, Snowflake for load, and DBT for transform. Now, consider the scenarios where the API is no longer available, there's an error in Snowflake, or you made a mistake in your DBT transformations. As you can see, any of these steps could result in a failure, and you need a tool to manage these potential issues. Additionally, what if you not only have 1 data pipeline but 100 of them? Managing this manually would be a nightmare. That's where Airflow comes in. With Airflow, you can automatically handle failures, even when dealing with hundreds of data pipelines and millions of tasks.
  • 7. 7 | P a g e What is the benefit using Airflow? Airflow offers numerous benefits that make it a powerful tool for orchestrating data pipelines. 1. Dynamic and Pythonic: Airflow's workflows are coded in Python, making it dynamic and flexible. Unlike static languages like XML used in tools like Oozie, Airflow leverages the power and versatility of Python for defining workflows. 2. Scalability: With Airflow, you can execute as many tasks as needed, enabling scalability to handle large and complex data pipelines. 3. User Interface: Airflow provides a visually appealing and intuitive user interface, allowing users to monitor tasks and data pipelines easily. 4. Extensibility: Airflow is highly extensible, enabling users to add their own plugins and custom functionalities without waiting for external updates. This flexibility empowers users to tailor Airflow to their specific needs. 5. Orchestration: As an orchestrator, Airflow ensures that tasks are executed in the right order and at the right time, maintaining the flow of data through the pipeline effectively. What is not Airflow? Airflow is not a data streaming solution, neither a data processing framework. Airflow does not require you to process your data within Airflow itself. Instead, data processing should occur within your tasks, using appropriate operators. Core Components of Airflow 1. DAG(Directed Acyclic Graphs): It is a fundamental concept and the core building block used to define and manage workflows. It represents a collection of tasks and their dependencies. It ensures tasks are executed in a linear sequence without forming loops, allowing efficient scheduling and automation of workflows. 2. The Airflow scheduler parses DAGs, verifies their schedule interval, and if the schedule is due, it schedules the tasks within the DAG for execution by passing them to the Airflow workers. 3. The Airflow workers pick up tasks scheduled for execution and carry out the actual execution, making them responsible for "doing the work."
  • 8. 8 | P a g e 4. The Airflow webserver visualizes the DAGs parsed by the scheduler and serves as the main interface for users to monitor DAG runs and their outcomes. 5. The metadata store in Airflow, is a database compatible with SQL Alchemy. This database stores metadata related to your data pipelines, tasks, Airflow users, and other relevant information. Schematic overview of the process involved in developing and executing pipelines as DAGs using Airflow. 1. Data Eng. Write workflow as DAG 2. Airflow Scheduler parse the DAG and Schedule task as per DAG, including dependency Airflow Scheduler Read DAG Files (task, dependency and schedule interval) If task scheduled passed, schedule DAG task Check dependency and add task to queue Wait X second DAG File (Python) DAG Folder/ Directory 3. Airflow workers execute scheduled tasks Airflow Worker Execution Queue Execute Task Retrieve Task Result Store Serialized DAGS Store Task Result 4. Airflow metastore (database) Support SqlAlchemy Airflow Webserver Retrieve DAGS and Task result 5. Airflow Webserver UI used to monitor execution User
  • 9. 9 | P a g e Growth of Airflow and Opportunity for Atlan Growth of Airflow and its popularity. Few Critical point to consider as Atlan point of View from Airflow Survey 2022. 1. User Roles: More than half of Airflow users are Data Engineers (54%). Other active users include Solutions Architects (13%), Developers (12%), DevOps (6%), and Data Scientists (4%). So what? - The user roles described in the Airflow User Survey align with the potential user personas at Atlan. 2. Company Size: Airflow is popular in larger companies, with 64% of users working in companies with 200+ employees, showing an 11% increase from 2020. So what? - Atlan can cater to the data management requirements of these bigger companies, increasing its market reach and attracting potential customers who are already invested in Airflow. 3. Willingness to Recommend: 65.9% of surveyed Airflow users are willing to recommend Apache Airflow, with a positive trend in willingness to recommend over previous years (85.7% in 2019, 92% in 2020). So what? For Atlan, this information is crucial as it indicates that integrating with Airflow can potentially expose Atlan to a user base that already finds value in Airflow. Atlan can tap into this positive sentiment and further increase the likelihood of recommendation among its users. 4. Deployments: Most Airflow users (85%) have between 1 to 7 active Airflow instances, and 62.5% have between 11 to 250 DAGs in their largest Airflow instance.
  • 10. 10 | P a g e So what? By catering to users who manage multiple instances and DAGs, Atlan's integration can enhance data orchestration capabilities, and improve efficiency for users who deal with large-scale data operations. 5. Airflow Versions: Close to 85% of users use one of the Airflow 2 versions. While 9.2% still use version 1.10.15, most users on Airflow 1 are planning migration to Airflow 2 soon. So what? - Integrating with the latest Airflow 2 versions ensures that Atlan can offer more to market with less effort. 6. Monitoring and Executors: Users show more interest in monitoring, with more interest in external monitoring services (40.7%) and information from metabase (35.7%). The most common executors used are Celery (52.7%) and Kubernetes (39.4%). So what? - By providing seamless integration with external data catalog and bringing meta data information from metabase, Atlan can empower users with comprehensive insights into the performance and health of their data workflows. 7. Usage and Lineage: 81.3% of Airflow users don't have any customization of Airflow. Xcom (69.8%) is the most popular method for passing inputs and outputs between tasks. Lineage is a new topic for many users, with 47.5% interested if supported by Airflow, 29% not familiar, and 13% not concerned about data lineage. So what? By offering robust data lineage capabilities as part of the integration, Atlan can meet the demands of users interested in data lineage and encourage adoption among those who may not be familiar with the concept. Limitations of Airflow 1. High Complexity and Non-Intuitive: Airflow's "pipeline/workflow as code" approach is powerful but can lead to complexity and non-intuitive development. Creating tasks in DAGs and sequencing them as workflows requires decent Python programming skills. Some users question the need to have workflows as code instead of configurations, as coding workflows can make debugging in production counter intuitive. 2. Adoption & Steep learning curve: Airflow's adoption has a steep learning curve for data engineers due to its complex framework, requiring them to understand the scheduler, executor, and workers. Additionally, the extensive list of configuration parameters, while offering flexibility, demands significant time investment. 3. Airflow lacks pipeline versioning, leading to the absence of stored workflow metadata changes. This limitation makes it challenging to track workflow history and perform pipeline rollbacks, especially in production environments.
  • 11. 11 | P a g e 4. Debugging with Airflow can be a nightmare as workflows become complex, leading to numerous interdependencies and potential issues. The use of operators and templates in DAGs, interchangeability, and the ability to execute code or orchestrate workflows add further complexity. Local and remote executors also pose challenges, making root cause analysis daunting in such intricate environments. 5. Data lineage in Airflow is rudimentary and marked as "experimental - subjected to change" in the documentation. As pipelines evolve, it requires rechecking and possibly reworking the DAG. Airflow lacks native information about jobs, owners, or repositories, making it challenging to understand downstream impacts from upstream changes. How does Atlan fill the Airflow Gap without diverting its strategy? Scenario: An executive at a large e-commerce company relies on a real-time sales dashboard to monitor daily sales performance across different product categories and regions. One morning, while reviewing the dashboard, the executive notices a sudden drop in sales for a specific product category in a particular region. Concerned about the potential impact on revenue, the executive immediately raises the issue with the data team to investigate and resolve the problem.
  • 12. 12 | P a g e Before Atlan Integration: The data team, consisting of data analysts and engineers, starts investigating the issue. They use Apache Airflow to manage the data pipelines responsible for fetching, processing, and loading sales data into the dashboard. However, tracking the issue becomes challenging due to the complex workflow and lack of pipeline context. The data team spends several hours manually debugging the pipelines and pulling logs from Airflow, trying to pinpoint the exact cause of the sales drop. Solution with Atlan Integration: Dashboard Alert and Pipeline Context: With Atlan's integration with Apache Airflow, the data team has set up dashboard alerts that automatically signal any issues in the pipelines. As soon as the sales drop is detected, an alert appears on the dashboard, notifying the executive of the anomaly. Pipeline Status: The executive opens the Atlan data asset profile associated with the sales dashboard. Within the profile, the executive can quickly check the pipeline status and metrics, including data freshness and run schedule. This information provides immediate context about the pipelines' health and helps the executive understand if the issue is due to delayed data updates or other pipeline-related problems. Data Lineage Visualization: The data team leverages Atlan's data lineage visualization feature to track the sales data flow from source systems to the dashboard. They quickly identify that a change in the data transformation pipeline caused the sales drop in the specific region. With this insight, the data team can focus their efforts on debugging and resolving the issue more effectively. Collaborative Data Analysis: Atlan's integration allows the data team to collaborate more efficiently. Data analysts, scientists, and business users can access pipeline context and data lineage information within Atlan, enabling them to understand the issue in detail and contribute to the resolution. Data Security and PII Data Masking: With Atlan's data security and governance features, the data team ensures that sensitive Personally Identifiable Information (PII) data is securely handled throughout the data pipeline. PII data masking is applied to protect customer privacy and comply with data regulations. The executive can rest assured that data security measures are in place to safeguard sensitive information.
  • 13. 13 | P a g e Atlan and Airflow together create an ecosystem of trust and transparency for data teams, fostering better data engineering experiences focused on building knowledge and confidence in their data. Through this integration, Atlan enriches data assets with essential pipeline context. Metadata from Airflow pipelines can be seamlessly shared with Atlan data asset profiles, granting access to data analysts, scientists, and business users. This transparency enables data teams and consumers to stay informed about the status of each data asset's associated pipeline. Heading CRM ERP Billing Marketing Reporting Analytics Data Mining Meta Data from Upstream Meta Data from data ware house Meta Data from airflow pipeline Meta Data from airflow pipeline Meta Data from Downstream Current Gap in Airflow can be filled by Atlan Integration between Atlan and Apache Airflow Who will be the personas in the data team that’ll use such an integration? • Data Engineers • Data Analysts • Data Scientists
  • 14. 14 | P a g e • Business Executives (Business Users and Stakeholders) • Data Steward Team Persona Data Engineer Job As a Data Engineer, the primary responsibility, is to design, build, and maintain data pipelines and infrastructure to ensure the efficient and reliable flow of data within the organization. Goal The main goal of a Data Engineer is to create scalable and robust data pipelines that support the organization's data-driven initiatives. They aim to deliver high-quality data that is accurate, consistent, and accessible for analytics and business decision-making. Problem Data Engineers often face challenges in managing complex data workflows and maintaining data quality. Some common problems they encounter include: • Data Quality and Integrity: Ensuring data quality and integrity throughout the pipeline is crucial. Inaccurate or incomplete data can lead to incorrect analyses and decisions. • Monitoring and Debugging: Identifying and resolving issues in data pipelines quickly is essential to minimize downtime and maintain data reliability. Without Data lineage Functionality Without a robust solution in place, Data Engineers may struggle to meet their data engineering goals effectively. Manual monitoring, debugging, and handling data quality issues can be time-consuming and prone to human errors. Lack of transparency and visibility into pipeline health may lead to delayed issue detection and resolution. With Atlan Data Lineage Functionality With Atlan's integration, the Data Engineer gains immediate visibility into the pipeline context and status. They access the data asset profile associated with the pipeline, which shows relevant metrics, data freshness, and lineage information. Leveraging data lineage visualization, the Data Engineer quickly pinpoints the source of the slowdown to a specific data source. The issue is traced to a schema change in the data source, impacting the data transformation process.
  • 15. 15 | P a g e 3. OPENLINEAGE What Is Data Lineage? Data Lineage is the process of tracking and documenting relationships among datasets within an organization's ecosystem. It traces data flow from source to destination, establishing parent/child relationships. This transparency aids data governance, compliance, quality management, and issue troubleshooting, ensuring data reliability and accuracy. Why we need data lineage? Data hierarchy need Data hierarchy starts with gaining access to interesting data, and then the focus shifts to real- time updates. Availability and freshness of data become essential, followed by ensuring data quality. Consistently supplied with good data, an organization can optimize and improve processes, identify anomalies, and drive business benefits. Data lineage is vital in achieving these higher-order benefits. New Oppertunity Business Oppertunity Data Quality Data Freshness Data Availability
  • 16. 16 | P a g e Misunderstanding of Data Democratization We are in era of data democratization. The successful democratization of data production and consumption empowers individuals in the organization to work with data independently, leading to a fragmented and chaotic data ecosystem. This fragmentation creates a challenge in understanding critical information about the data, such as its source, schema, ownership, creation date, and update frequency. The lack of understanding of actual data democratization turns data into a "black box," making it difficult to trace data origins, transformations, and usage. While the democratization of data is beneficial for fostering a healthy data ecosystem, the absence of data lineage can hinder data transparency and impact decision-making processes. Principle of Data Democratization https://www.linkedin.com/feed/update/urn:li:activity:7078983264836747264/
  • 17. 17 | P a g e What is OpenLineage? OpenLineage is a framework that enables transparent and standardized data lineage tracking, providing clear insights into data origins, transformations, and dependencies throughout the data ecosystem. OpenLineage is a community-driven initiative with diverse stakeholders interested in data lineage. Instead of being restricted to a single vendor, it aims to establish an open standard for data lineage across the industry. The goal is to create a common language for discussing data movement across various tools, benefiting everyone involved in the ecosystem. Its foremost purpose is to define an open standard for collecting lineage information collaboratively. How OpenLineage started? • In 2017, Julien Le Dem, one of the co-founders of OpenLineage, faced challenges at WeWork in understanding data pipeline dependencies. To address this, he developed Marquez, aiming to create a "Google Maps" for data pipelines. Marquez was later open sourced by WeWork to benefit other companies. • Marquez is an open-source metadata service for tracking the data lineage of datasets. Marquez aims to provide a central metadata repository and API to track the provenance and lineage of datasets, making it easier for data engineers and data scientists to understand and trace data flow across systems. • In 2020, the team behind Marquez shifted focus to data observability and launched OpenLineage is an independent open-source initiative focused on providing a common
  • 18. 18 | P a g e data lineage standard and framework for the industry. It aims to establish a standardized way of collecting, storing, and querying data lineage information, enabling seamless integration across various data tools and frameworks. How Data Lineage Solved? Traditional/Google Approach: In this tactic, the source code repository is scanned, and queries are extracted from the code. The analysis then predicts potential data relationships and interactions if these queries were to run. This method is beneficial for design-time lineage, helping understand data structures during application development and predicting data footprints. However, it does not provide observational lineage since it does not track actual data execution or runtime information. Passive Data Lineage- This approach involves integrating with the data store or warehouse to analyze query, logs, and access information regularly. The data lineage relationships are extracted from the data source and stored in a central metadata repository. This method is commonly used for governance purposes. Data Lineage Repository Data Lineage Repository Crawl and parse all the source code
  • 19. 19 | P a g e Active Data Lineage - Instead of directly accessing the data source, this approach integrates with the data orchestration system. It observes job executions and tracks their impact on data. Like passive lineage, the information is reported to a central metadata repository. Both approaches have a central repository for data lineage metadata, with the key difference being where the lineage information is collected from - either from the data source (passive) or the data orchestration system (active). The choice depends on specific scenarios and requirements, and both approaches offer valuable insights for data lineage analysis and management. How OpenLineage benefits the ecosystem? Before OpenLineage: • Each project needs to create custom metadata collection integrations, leading to duplicated efforts. • External integrations can break with updates to the scheduler or data processing framework, requiring backward compatibility checks. Data Lineage Repository
  • 20. 20 | P a g e https://openlineage.io/docs/ With OpenLineage: • Integration efforts are shared across projects, reducing duplication. • Integrations can be seamlessly integrated into the underlying scheduler and/or data processing framework, eliminating compatibility concerns and the need for constant catch-up. https://openlineage.io/docs/
  • 21. 21 | P a g e How OpenLineage fits into eco-system? A simpler way to look at the OpenLineage architecture is to break it down into three essential components. Producers: The components that capture and send metadata about data pipelines and their dependencies to the OpenLineage server. Backend: The infrastructure that helps you store, process, and analyze metadata. Consumers: The components that fetch metadata from the server and feed it to web-based interfaces so that you can visualize the data lineage mapping. Tech Stack The OpenLineage tech stack comprises: • Metadata Collection APIs: Integrates with diverse data sources and data pipeline tools such as Spark, Airflow, and dbt. • Lineage Server (Marquez): Serves as the central repository for storing metadata related to lineage events. • Web-based UI: Enables visualization of data lineage and dependencies. REST APIs facilitate capturing and querying metadata. • Libraries: Offers libraries for common programming languages to facilitate integration and usage within different environments.
  • 22. 22 | P a g e The data model of OpenLineage revolves around the life cycle of a job. Every job run undergoes a series of events that provide observational lineage updates. Here is a breakdown of the life cycle and the corresponding events: • Job Start: When a job starts, an observational lineage update is sent with an event type of "start." This update includes the start time and the producer's information. • Job Abort/Job Failed: Throughout the job's execution, additional metadata about input data sets can be sent along the life cycle to augment the observational lineage updates. • Job Completion: When the job is complete, a "complete" event is sent, indicating the end time of the job run. Standardized Entities: The core model defines generic entities such as datasets, jobs, and runs, which are uniquely identified using consistent naming strategies. These entities form the foundation for capturing lineage information across the data ecosystem. Facets: A facet is an atomic piece of metadata attached to one of the core entities. Facets enable entity enrichment, making the core model highly extensible. Users can define and include custom facets to capture additional metadata specific to their use case. Example of Facets For DataSet: Stats, Version, Schema For Job: Source code, Dependencies, Source control, Query plan
  • 23. 23 | P a g e For Run: Scheduled time, Batch ID Data model In the OpenLineage data model: Dataset: Represents a unit of data, like a table in a database or an object in cloud storage. Changes to datasets occur when the corresponding job completes its writing process. Job: Refers to a data pipeline process responsible for creating or consuming datasets. Jobs may undergo evolution, and documenting these changes is vital for understanding the pipeline's functioning. Run: Signifies an executed instance of a job. It contains essential information, including the start and completion (or failure) time of the job. Runs facilitate observation of job changes over time. How it creates lineage? OpenLineage builds lineage based on correlations between jobs and datasets. When the first job starts, OpenLineage observes the metadata about that job and the datasets it interacts with. As subsequent jobs run, they may reference the same datasets, establishing correlations. Lineage is established by linking jobs to the datasets they create or consume. The correlations 1c0386aa-0979-41e3-9861- 3a330623effa prod.customer_segmentation dev.run_unit_tests postgres://db.foo.com/metrics.sales orders s3://sales-metrics/orders.csv Example
  • 24. 24 | P a g e are based on data set names, emphasizing the importance of well-defined and consistent naming conventions for effective lineage tracking. Correlate Correlate Correlate Observe Observe Observe OpenLineage features OpenLineage offers a comprehensive set of features to collect, track, and analyze lineage within your data ecosystem. While still in development, some key features include: Metadata Capture: • Technical metadata for datasets and jobs, including schema, version, ownership details, run time, and job dependencies. • Dataset-level and column-level data quality metrics, such as row count, byte size, null count, distinct count, average, min, max, and quantiles. • Column-level lineage tracing for Apache Spark data. Seamless Integration: • Integration with various ETL frameworks, data orchestration engines, data quality engines, and data lineage tools. • Support for popular data stores and warehouses, including Amazon S3, Amazon Redshift, HDFS, Google BigQuery, PostgreSQL, Azure Synapse, and Snowflake. • Integration with data orchestration tools like Apache Airflow, Apache Spark, and dbt. • Compatibility with data lineage tools like Egeria. These features empower data teams with enhanced visibility and control over their data pipelines, enabling efficient data management and governance throughout the data lifecycle. Endless Use case of OpenLineage 1. Gain visibility into the dependencies between datasets and jobs, understanding how data flows across the entire ecosystem make data observability easy. 2. Quickly identify the root cause of data-related issues or errors by tracing lineage and understanding the flow of data.
  • 25. 25 | P a g e 3. Map the impact of changes in datasets or jobs across the data ecosystem, enabling proactive decision-making. 4. Perform precise backfills by understanding the lineage of affected datasets and jobs, ensuring data consistency. 5. Track changes to datasets and jobs over time, facilitating effective change management and version control. 6. Analyze historical data lineage to gain insights into data evolution and usage patterns. 7. Conduct automated audits of data pipelines, ensuring compliance and data governance standards are met. Adoption and Additional Info • Amazon: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon- mwaa-with-openlineage/ • Menta: https://openlineage.io/blog/manta-integration/ • "Berlin Buzzwords 2023: Column-level lineage is coming to the rescue” • Getting Started: https://openlineage.io/getting-started/ • The lineage API: https://openlineage.io/blog/explore-lineage-api/ • Facets: https://openlineage.io/blog/dataquality_expectations_facet/ • Spark example: https://openlineage.io/blog/openlineage-spark/
  • 26. 26 | P a g e 4. PARTNERSHIPS WITH ASTRONOMER What problem does astronomer Solves? Astronomer solves several problems related to managing and operating Apache Airflow, a popular open-source data orchestration tool. • Infrastructure Management: Setting up, managing, and maintaining the underlying infrastructure for Airflow can be complex and time-consuming. Astronomer provides a managed service that takes care of infrastructure provisioning, ensuring that data teams can focus on building and managing data pipelines rather than dealing with infrastructure tasks. • Scalability: As data pipelines and workflows grow in complexity and scale, managing the scalability of Airflow becomes crucial. Astronomer deploys Airflow on Kubernetes, which offers scalable and flexible infrastructure, allowing Airflow instances to handle large volumes of data and complex workflows. • Deployment and Setup: Astronomer offers a simplified and streamlined deployment process for Airflow instances. With single-click provisioning, users can quickly set up and get started with Airflow, reducing the time and effort required to configure and launch Airflow environments. • Monitoring and Support: Astronomer provides monitoring and support services to ensure the smooth running of Airflow instances. This includes performance monitoring, troubleshooting assistance, and timely support to address any issues that may arise during pipeline execution. • Security and Authentication: Astronomer provisions auxiliary services, including logging and authentication, to enhance the security of Airflow instances and data processing. This helps data teams ensure that their data pipelines are secure and compliant with data privacy regulations. • Integration with Data Tools: Astronomer integrates with various data sources, data stores, and other tools commonly used in data engineering workflows. This seamless
  • 27. 27 | P a g e integration simplifies the process of building data pipelines and allows data teams to leverage their existing data ecosystem. In summary, Astronomer solves the challenges associated with managing, scaling, and operating Airflow, enabling data teams to focus on building efficient and reliable data pipelines and workflows without the complexities of infrastructure management. About Astronomer: Astronomer is a global remote-first company with teams around the world and offices in Cincinnati, Ohio; New York; San Francisco; and San Jose, Calif. Customers in more than 35 countries trust Astronomer as their partner for data orchestration. Customer List • Astronomer's demand is experiencing rapid growth as more teams seek to understand their data landscape and harness the potential of data orchestration. • Astronomer serves a diverse customer base, ranging from innovative startups to major retailers worldwide. • A Fortune 10 customer achieved a remarkable $300 million increase in sales by leveraging Astronomer's capabilities.
  • 28. 28 | P a g e • The platform is trusted by nine out of the top 12 gaming companies and three of the top 10 Wall Street banks. • G2 rating 4.7 out of 5. • Gartner, leading research, and advisory company, recognized Astronomer as a Cool Vendor in Data for Artificial Intelligence and Machine Learning in 2021, validating its effectiveness and potential. • Astronomer's team plays a crucial role in the development of Airflow, with 16 of the top 25 all-time contributors being from Astronomer, influencing the platform's direction since 2018. • With the acquisition of Datakin, Astronomer enhances its offering by providing data teams with native operational lineage capabilities integrated into their orchestration platform. Astronomer aims to set up end-to-end lineage so that its customers can observe and add context to disparate pipelines. • Astro, the fully managed data orchestration platform designed and operated by the core developers behind Apache Airflow and OpenLineage, is now available on Amazon Web Services (AWS), Microsoft Azure, and Google Cloud in 47 regions across six continents. Partnership with Astronomer Benefits for Atlan • Increased visibility into Airflow pipelines. Atlan could integrate with Astronomer to provide a holistic view of all Airflow pipelines, including their dependencies, lineage, and metrics. This would give Atlan users a better understanding of how their data is flowing through their pipelines and help them to identify any potential problems. • Increased adoption of Atlan. By partnering with Astronomer, Atlan could reach a wider audience of Airflow users. This would help to increase the adoption of Atlan, and make it the go-to tool for managing Airflow pipelines. • Addition Changes in Airflow: Astronomer's team already has a strong relationship with the Airflow community, this partnership would give Atlan greater visibility and credibility within the Airflow ecosystem. It would create opportunities for Atlan to engage with the
  • 29. 29 | P a g e Airflow community, gather feedback, and contribute back to the open-source project, further solidifying its position as a leading player in the data orchestration space. • With Astronomer's acquisition of Datakin and the addition of native operational lineage capabilities to their orchestration platform, Atlan can benefit from a seamless integration of data lineage within Airflow. Benefits for Astronomer • Simplified Data Discovery: Atlan's data catalog can integrate with Astronomer to provide a unified view of datasets and their lineage. Data engineers using Astronomer can easily discover and access relevant datasets from Atlan's catalog, streamlining their workflow. • Enhanced Data Governance: Atlan's data governance capabilities can complement Astronomer's managed Airflow service by providing data cataloging, data lineage, and access control features. This integration ensures data pipelines are well-documented and governed, enhancing data quality and compliance. • Product Roadmap Alignment: Atlan's expertise in data management trends and technology advancements can ensure that Astronomer remains at the forefront of delivering cutting-edge solutions. • Joint Workshops and Webinars: Atlan and Astronomer can organize joint workshops, webinars, and knowledge-sharing events to educate the data community about the benefits of combining data orchestration and operational lineage. These events could showcase the power of their combined offerings and attract new customers. The ideal product integration would allow users to seamlessly switch between Atlan and Astronomer. This would make it easy for users to manage their Airflow pipelines and to track the lineage of their data and directly check the issue in pipeline. Thanks You Subrat Kumar Dash
  • 30. 30 | P a g e 5. REFERENCE LINKS 1. What is Airflow™? — Airflow Documentation (apache.org) 2. https://www.astronomer.io/airflow/ 3. https://www.linkedin.com/pulse/apache-airflow-challenges-ranganath- s/?trackingId=yEYBjnRBQBq3b3FGDwsahw%3D%3D 4. https://openlineage.io/ 5. https://atlan.com/open-lineage/#openlineage-architecture 6. https://www.youtube.com/watch?v=2s013GQy1Sw&ab_channel=Astronomer