SlideShare a Scribd company logo
1 of 44
Download to read offline
The Future Roadmap for the
Composable Data Stack
Wes McKinney
Data Council Austin 2024
⌵
⌵
Agenda
● My background
● Posit and Me
● Composable data systems: an overview
● Active growth areas in 2024
● Opportunities and Predictions
⌵
⌵
Me
Principal
Architect
General Partner
Co-founder,
Advisor
Ibis
Creator Co-creator
Creator Advisor,
Investor
Author
⌵
⌵
Voltron Data (2021 — )
● Unlocking the potential of GPU
accelerated analytics for large
scale workloads
● Enterprise support for Apache
Arrow
● Open source partnerships
(Meta, Snowflake, others)
● $115M raised, 130 headcount
and growing
⌵
⌵
⌵
⌵
Posit PBC
● Founded 2009, originally as RStudio
● 300-person, remote-first company
● Open source software for code-first data science and
technical communication
● Certified B corp, no plans to go public or be acquired
● Designing for long-term resiliency, creating a 100-year
company
⌵
⌵
How Posit Makes Money
● Making open source work in the enterprise
● Products for managing the data science lifecycle
○ Posit Workbench: Managed IDEs and Jupyter
Notebooks with enterprise-grade security & compliance
○ Posit Connect: Publish and share interactive data apps,
notebooks, and reports
○ Posit Package Manager: securely manage deploy open
source and proprietary Python and R packages
⌵
⌵
My History with Posit
● 2016: Hadley Wickham and I make Feather file format
● 2018: I partner with RStudio to create non-profit Ursa Labs
for Arrow development
● 2020: Ursa team spins out into startup, becomes Voltron
Data in 2021
● 2022: RStudio becomes Posit, embracing polyglot future
● 2023: I rejoin Posit as a principal architect
⌵
⌵
https://ursalabs.org/blog/announcing-ursalabs/
The reality is that Hadley and I think the “language wars” are stupid when the
real problem we are solving is human user interface design for data analysis.
… The programming languages are our medium for crafting accessible and
productive tools. It has long been a frustration of mine that it isn’t easier to
share code and systems between R and Python.
…
I found that we share a passion for the long-term vision of empowering data
scientists and building a positive relationship with the open source user community.
Critically, RStudio has avoided the “startup trap” and managed to build a sustainable
business while still investing the vast majority of its engineering resources in open
source development.
⌵
⌵
https://ursalabs.org/blog/announcing-ursalabs/
The reality is that Hadley and I think the “language wars” are stupid when the real
problem we are solving is human user interface design for data analysis. … The
programming languages are our medium for crafting accessible and productive tools.
It has long been a frustration of mine that it isn’t easier to share code and systems
between R and Python.
…
I found that we share a passion for the long-term vision of empowering data
scientists and building a positive relationship with the open source user
community. Critically, RStudio has avoided the “startup trap” and managed to
build a sustainable business while still investing the vast majority of its
engineering resources in open source development.
⌵
⌵
Creative Solutions for Funding OSS Work
Sponsors
⌵
⌵
VLDB 2023
⌵
⌵
What is a “Composable Data System”?
● Builds with open standards and protocols
● Designed around modularity, reuse, and interoperability
with other systems that use shared interfaces
● Resists “vertical integration”, builds in a virtual cycle with
relevant open source ecosystem projects
⌵
⌵
“We envision that by decomposing data management systems
into a more modular stack of reusable components, the
development of new engines can be streamlined, while reducing
maintenance costs and ultimately providing a more consistent
user experience. By clearly outlining APIs and encapsulating
responsibilities, data management software could more easily be
adapted, for example, to leverage novel devices and accelerators, as
the underlying hardware evolves. By relying on a modular stack that
reuses execution engine and language frontend, data systems code
could provide a more consistent experience and semantics to users,
from transactional to analytic systems, from stream processing to
machine learning workloads."
Pedrera et al. 2023
⌵
⌵
“We envision that by decomposing data management systems into a
more modular stack of reusable components, the development of new
engines can be streamlined, while reducing maintenance costs and
ultimately providing a more consistent user experience. By clearly
outlining APIs and encapsulating responsibilities, data
management software could more easily be adapted, for example,
to leverage novel devices and accelerators, as the underlying
hardware evolves. By relying on a modular stack that reuses
execution engine and language frontend, data systems code could
provide a more consistent experience and semantics to users, from
transactional to analytic systems, from stream processing to machine
learning workloads."
Pedrera et al. 2023
⌵
⌵
“We envision that by decomposing data management systems into a
more modular stack of reusable components, the development of new
engines can be streamlined, while reducing maintenance costs and
ultimately providing a more consistent user experience. By clearly
outlining APIs and encapsulating responsibilities, data management
software could more easily be adapted, for example, to leverage novel
devices and accelerators, as the underlying hardware evolves. By
relying on a modular stack that reuses execution engine and
language frontend, data systems code could provide a more
consistent experience and semantics to users, from transactional
to analytic systems, from stream processing to machine learning
workloads."
Pedrera et al. 2023
⌵
⌵
Why now?
● 1st-Gen Big Data / Hadoop Era: 2006 - 2013
○ MapReduce popularizes disaggregated storage + compute
● 2nd-Gen 2013 - 2021
○ Vendors shift from proprietary software to delivery of services
○ Open source standards emerge and are popularized
○ Rapid progress in storage, networking, computing performance
● 3rd-Gen 2021 - … ?
○ Open standards for composability become widely accepted
○ Next-gen components emerge, existing systems start retrofitting with
composable pieces
⌵
⌵
Why now?
“...We foresee that composability is soon to cause
another major disruption to how data management
systems are designed. We foresee that monolithic
systems will become obsolete, and give space to a
new composable era for data management.”
Pedrera et al. 2023
⌵
⌵
NYC R Conference 2015
https://www.youtube.com/watch?v=stlxbC7uIzM&t=264s
⌵
⌵
McSherry, Isard, Murray 2015
⌵
⌵
“The published work on big data systems… detail[s]
their systems’ impressive scalability, [but] few directly
evaluate their absolute performance against
reasonable benchmarks. To what degree are these
systems truly improving performance, as opposed to
parallelizing overheads that they themselves
introduce?”
- McSherry, Isard, Murray 2015
⌵
⌵
Glimpse of the Composable Landscape
Engines
Substrait
ADBC, Flight
Storage
Protocols
Query Interface
Optimization
…?
⌵
⌵
Missing pieces
● Distributed computing (Spark, Dask, Ray, “Serverless”)
● ETL modeling (dbt and others)
● Workflow / Infra Orchestration (Airflow, Dagster, Flyte, …)
● Development Environments
● Metadata Catalogues
⌵
⌵
2016: A Cross-Language Fast
In-Memory Data Format
⌵
⌵
https://www.slideshare.net/wesm/data-science-without-borders-jupytercon-2017
2017
⌵
⌵
2017
⌵
⌵
Databases Equipped for the Composable Era
● ADBC : Arrow DataBase Connectivity
● FlightSQL: An Arrow-native wire protocol for SQL systems
⌵
⌵
⌵
⌵
2018
⌵
⌵
Layers of the Composable Cake
⌵
⌵
Modular Execution Engines
● Apache DataFusion (Rust)
● DuckDB (C++)
● Velox (C++)
● Theseus (C++ / CUDA)
⌵
⌵
Connecting Execution with Frontend + Optimizer
● Intermediate Representation (IR) for Queries
Substrait
https://substrait.io/
⌵
⌵
Modular Acceleration Projects
● Prestissimo (Velox in Presto)
● Apache DataFusion Comet (DataFusion in Spark)
● Apache Gluten (incubating) (Velox in Spark)
⌵
⌵
A Multi-Engine Data Stack?
● Why be locked into one full-stack execution engine (like
Spark)?
● Cost/performance/latency varies greatly across workload
shapes and sizes
● You can do a lot with DuckDB, but actual big data does still
exist
HT to Julien Hurault ! https://juhache.substack.com/p/multi-engine-data-stack-v0
⌵
⌵
⌵
⌵
Barriers to a multi-engine data stack
● SQL dialects are non-portable
and feature a wide spectrum of
supported features
○ sqlglot is helping with this!
● Deciding which engine-to-use is
non-trivial
● Diverse orchestration and
infrastructure requirements
⌵
⌵
CIDR 2021
⌵
⌵
⌵
⌵
Enter the Bird
● An effort to harmonize the best of modern SQL with
Pythonic fluent data frames
● Fixing a long list of shortcomings in the pandas API
● Leverage the benefits of a modern PL when creating
complex analytical SQL queries
● Bringing portability (> 20 backends supported) to provide a
unified Python API for a multi-engine data stack
ibis-project.org
⌵
⌵
https://ibis-project.org/posts/bigquery-arrays/
⌵
⌵
Other Ibis Notes
● Created in 2015 at the same time as Arrow
● Strong focus on deep SQL support
● Facilitates dev to prod with few code changes
● Integration with many third party libraries (streamlit, altair,
others)
● Helps automating the drudgery of tedious multi-step DDL
workflows in databases
⌵
⌵
Why this matters https://voltrondata.com/theseus
⌵
⌵
A Future Multi-Engine Stack
● An execution engine tailored for different data scales
○ <= 1TB: DuckDB and Friends
○ 1 - 10TB: Spark, Dask, Ray, etc.
○ > 10TB: Hardware-accelerated Processing (e.g. Theseus)
● A portable language front end (Ibis, Malloy, PRQL, or a
standard transpilable SQL)
● Arrow-native API and wire transport
Closing Thoughts

More Related Content

What's hot

Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)Pulkit Gupta
 
はじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチーム
はじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチームはじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチーム
はじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチーム勇 黒沢
 
All about paas_iaas_saas_29.01.2015
All about paas_iaas_saas_29.01.2015All about paas_iaas_saas_29.01.2015
All about paas_iaas_saas_29.01.2015mihaiburada
 
Data librarian: helping researchers with data issues
Data librarian: helping researchers with data issues Data librarian: helping researchers with data issues
Data librarian: helping researchers with data issues Mari Elisa Kuusniemi
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...InfluxData
 
Social Media Analytics using Amazon QuickSight - AWS Online Tech Talks
Social Media Analytics using Amazon QuickSight - AWS Online Tech TalksSocial Media Analytics using Amazon QuickSight - AWS Online Tech Talks
Social Media Analytics using Amazon QuickSight - AWS Online Tech TalksAmazon Web Services
 
Agile Estimating & Planning by Amaad Qureshi
Agile Estimating & Planning by Amaad QureshiAgile Estimating & Planning by Amaad Qureshi
Agile Estimating & Planning by Amaad QureshiAmaad Qureshi
 
Azure fundamentals-170910113238
Azure fundamentals-170910113238Azure fundamentals-170910113238
Azure fundamentals-170910113238ScottSmith574468
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSightAmazon Web Services
 
Research in Cloud Computing
Research in Cloud ComputingResearch in Cloud Computing
Research in Cloud ComputingRajshri Mohan
 
Storwize v5000
Storwize v5000Storwize v5000
Storwize v5000liembkeb
 
HubSpot Partner Program Launch Webinar
HubSpot Partner Program Launch WebinarHubSpot Partner Program Launch Webinar
HubSpot Partner Program Launch WebinarHubSpot
 

What's hot (20)

Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)
 
はじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチーム
はじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチームはじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチーム
はじめよう FinOps クラウドコスト最適化への第一歩とは 日本IBMカスタマーサクセスチーム
 
All about paas_iaas_saas_29.01.2015
All about paas_iaas_saas_29.01.2015All about paas_iaas_saas_29.01.2015
All about paas_iaas_saas_29.01.2015
 
Data librarian: helping researchers with data issues
Data librarian: helping researchers with data issues Data librarian: helping researchers with data issues
Data librarian: helping researchers with data issues
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
 
Social Media Analytics using Amazon QuickSight - AWS Online Tech Talks
Social Media Analytics using Amazon QuickSight - AWS Online Tech TalksSocial Media Analytics using Amazon QuickSight - AWS Online Tech Talks
Social Media Analytics using Amazon QuickSight - AWS Online Tech Talks
 
Agile Estimating & Planning by Amaad Qureshi
Agile Estimating & Planning by Amaad QureshiAgile Estimating & Planning by Amaad Qureshi
Agile Estimating & Planning by Amaad Qureshi
 
Azure dev ops
Azure dev opsAzure dev ops
Azure dev ops
 
Azure fundamentals-170910113238
Azure fundamentals-170910113238Azure fundamentals-170910113238
Azure fundamentals-170910113238
 
DevOps seminar ppt
DevOps seminar ppt DevOps seminar ppt
DevOps seminar ppt
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Research in Cloud Computing
Research in Cloud ComputingResearch in Cloud Computing
Research in Cloud Computing
 
Cloud Computing with AWS & Other Cloud Platforms
Cloud Computing with AWS & Other Cloud PlatformsCloud Computing with AWS & Other Cloud Platforms
Cloud Computing with AWS & Other Cloud Platforms
 
Storwize v5000
Storwize v5000Storwize v5000
Storwize v5000
 
Azure Stack Overview
Azure Stack OverviewAzure Stack Overview
Azure Stack Overview
 
cloud computing ppt
cloud computing pptcloud computing ppt
cloud computing ppt
 
Microsoft azure
Microsoft azureMicrosoft azure
Microsoft azure
 
Whoop Company Presentation
Whoop Company PresentationWhoop Company Presentation
Whoop Company Presentation
 
HubSpot Partner Program Launch Webinar
HubSpot Partner Program Launch WebinarHubSpot Partner Program Launch Webinar
HubSpot Partner Program Launch Webinar
 
Cloud Computing Basics
Cloud Computing BasicsCloud Computing Basics
Cloud Computing Basics
 

Similar to The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET Journal
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric TeamsData Con LA
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...phdAssistance1
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistancephdAssistance1
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Making advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMaking advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMárton Kodok
 
MSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BIMSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BIOrlando Mariano
 
Consulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSConsulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSVictor M Torres
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSemantic Web Company
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014KDZ - Zentrum für Verwaltungsforschung
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution vty
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to GreenJohn Archer
 
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in BigdataIRJET Journal
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil TechnologiesBlack Basil Technologies
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 

Similar to The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024 (20)

The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
 
Coding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - PhdassistanceCoding software and tools used for data science management - Phdassistance
Coding software and tools used for data science management - Phdassistance
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Making advanced analytics accessible to more companies
Making advanced analytics accessible to more companiesMaking advanced analytics accessible to more companies
Making advanced analytics accessible to more companies
 
MSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BIMSPInspire - Azure ML and Power BI
MSPInspire - Azure ML and Power BI
 
Consulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCSConsulting Profile_Victor_Torres_2016-VVCS
Consulting Profile_Victor_Torres_2016-VVCS
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategies
 
Abhishek jaiswal
Abhishek jaiswalAbhishek jaiswal
Abhishek jaiswal
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
 
5 years of Dataverse evolution
5 years of Dataverse evolution 5 years of Dataverse evolution
5 years of Dataverse evolution
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
IRJET- Cost Effective Workflow Scheduling in Bigdata
IRJET-  	  Cost Effective Workflow Scheduling in BigdataIRJET-  	  Cost Effective Workflow Scheduling in Bigdata
IRJET- Cost Effective Workflow Scheduling in Bigdata
 
It Consulting & Services - Black Basil Technologies
It Consulting & Services  - Black Basil TechnologiesIt Consulting & Services  - Black Basil Technologies
It Consulting & Services - Black Basil Technologies
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 

More from Wes McKinney (20)

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Council Austin 2024

  • 1. The Future Roadmap for the Composable Data Stack Wes McKinney Data Council Austin 2024
  • 2. ⌵ ⌵ Agenda ● My background ● Posit and Me ● Composable data systems: an overview ● Active growth areas in 2024 ● Opportunities and Predictions
  • 4. ⌵ ⌵ Voltron Data (2021 — ) ● Unlocking the potential of GPU accelerated analytics for large scale workloads ● Enterprise support for Apache Arrow ● Open source partnerships (Meta, Snowflake, others) ● $115M raised, 130 headcount and growing
  • 6. ⌵ ⌵ Posit PBC ● Founded 2009, originally as RStudio ● 300-person, remote-first company ● Open source software for code-first data science and technical communication ● Certified B corp, no plans to go public or be acquired ● Designing for long-term resiliency, creating a 100-year company
  • 7. ⌵ ⌵ How Posit Makes Money ● Making open source work in the enterprise ● Products for managing the data science lifecycle ○ Posit Workbench: Managed IDEs and Jupyter Notebooks with enterprise-grade security & compliance ○ Posit Connect: Publish and share interactive data apps, notebooks, and reports ○ Posit Package Manager: securely manage deploy open source and proprietary Python and R packages
  • 8. ⌵ ⌵ My History with Posit ● 2016: Hadley Wickham and I make Feather file format ● 2018: I partner with RStudio to create non-profit Ursa Labs for Arrow development ● 2020: Ursa team spins out into startup, becomes Voltron Data in 2021 ● 2022: RStudio becomes Posit, embracing polyglot future ● 2023: I rejoin Posit as a principal architect
  • 9. ⌵ ⌵ https://ursalabs.org/blog/announcing-ursalabs/ The reality is that Hadley and I think the “language wars” are stupid when the real problem we are solving is human user interface design for data analysis. … The programming languages are our medium for crafting accessible and productive tools. It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. … I found that we share a passion for the long-term vision of empowering data scientists and building a positive relationship with the open source user community. Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development.
  • 10. ⌵ ⌵ https://ursalabs.org/blog/announcing-ursalabs/ The reality is that Hadley and I think the “language wars” are stupid when the real problem we are solving is human user interface design for data analysis. … The programming languages are our medium for crafting accessible and productive tools. It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. … I found that we share a passion for the long-term vision of empowering data scientists and building a positive relationship with the open source user community. Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development.
  • 11. ⌵ ⌵ Creative Solutions for Funding OSS Work Sponsors
  • 13. ⌵ ⌵ What is a “Composable Data System”? ● Builds with open standards and protocols ● Designed around modularity, reuse, and interoperability with other systems that use shared interfaces ● Resists “vertical integration”, builds in a virtual cycle with relevant open source ecosystem projects
  • 14. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 15. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 16. ⌵ ⌵ “We envision that by decomposing data management systems into a more modular stack of reusable components, the development of new engines can be streamlined, while reducing maintenance costs and ultimately providing a more consistent user experience. By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators, as the underlying hardware evolves. By relying on a modular stack that reuses execution engine and language frontend, data systems code could provide a more consistent experience and semantics to users, from transactional to analytic systems, from stream processing to machine learning workloads." Pedrera et al. 2023
  • 17. ⌵ ⌵ Why now? ● 1st-Gen Big Data / Hadoop Era: 2006 - 2013 ○ MapReduce popularizes disaggregated storage + compute ● 2nd-Gen 2013 - 2021 ○ Vendors shift from proprietary software to delivery of services ○ Open source standards emerge and are popularized ○ Rapid progress in storage, networking, computing performance ● 3rd-Gen 2021 - … ? ○ Open standards for composability become widely accepted ○ Next-gen components emerge, existing systems start retrofitting with composable pieces
  • 18. ⌵ ⌵ Why now? “...We foresee that composability is soon to cause another major disruption to how data management systems are designed. We foresee that monolithic systems will become obsolete, and give space to a new composable era for data management.” Pedrera et al. 2023
  • 19. ⌵ ⌵ NYC R Conference 2015 https://www.youtube.com/watch?v=stlxbC7uIzM&t=264s
  • 21. ⌵ ⌵ “The published work on big data systems… detail[s] their systems’ impressive scalability, [but] few directly evaluate their absolute performance against reasonable benchmarks. To what degree are these systems truly improving performance, as opposed to parallelizing overheads that they themselves introduce?” - McSherry, Isard, Murray 2015
  • 22. ⌵ ⌵ Glimpse of the Composable Landscape Engines Substrait ADBC, Flight Storage Protocols Query Interface Optimization …?
  • 23. ⌵ ⌵ Missing pieces ● Distributed computing (Spark, Dask, Ray, “Serverless”) ● ETL modeling (dbt and others) ● Workflow / Infra Orchestration (Airflow, Dagster, Flyte, …) ● Development Environments ● Metadata Catalogues
  • 24. ⌵ ⌵ 2016: A Cross-Language Fast In-Memory Data Format
  • 27. ⌵ ⌵ Databases Equipped for the Composable Era ● ADBC : Arrow DataBase Connectivity ● FlightSQL: An Arrow-native wire protocol for SQL systems
  • 30. ⌵ ⌵ Layers of the Composable Cake
  • 31. ⌵ ⌵ Modular Execution Engines ● Apache DataFusion (Rust) ● DuckDB (C++) ● Velox (C++) ● Theseus (C++ / CUDA)
  • 32. ⌵ ⌵ Connecting Execution with Frontend + Optimizer ● Intermediate Representation (IR) for Queries Substrait https://substrait.io/
  • 33. ⌵ ⌵ Modular Acceleration Projects ● Prestissimo (Velox in Presto) ● Apache DataFusion Comet (DataFusion in Spark) ● Apache Gluten (incubating) (Velox in Spark)
  • 34. ⌵ ⌵ A Multi-Engine Data Stack? ● Why be locked into one full-stack execution engine (like Spark)? ● Cost/performance/latency varies greatly across workload shapes and sizes ● You can do a lot with DuckDB, but actual big data does still exist HT to Julien Hurault ! https://juhache.substack.com/p/multi-engine-data-stack-v0
  • 36. ⌵ ⌵ Barriers to a multi-engine data stack ● SQL dialects are non-portable and feature a wide spectrum of supported features ○ sqlglot is helping with this! ● Deciding which engine-to-use is non-trivial ● Diverse orchestration and infrastructure requirements
  • 39. ⌵ ⌵ Enter the Bird ● An effort to harmonize the best of modern SQL with Pythonic fluent data frames ● Fixing a long list of shortcomings in the pandas API ● Leverage the benefits of a modern PL when creating complex analytical SQL queries ● Bringing portability (> 20 backends supported) to provide a unified Python API for a multi-engine data stack ibis-project.org
  • 41. ⌵ ⌵ Other Ibis Notes ● Created in 2015 at the same time as Arrow ● Strong focus on deep SQL support ● Facilitates dev to prod with few code changes ● Integration with many third party libraries (streamlit, altair, others) ● Helps automating the drudgery of tedious multi-step DDL workflows in databases
  • 42. ⌵ ⌵ Why this matters https://voltrondata.com/theseus
  • 43. ⌵ ⌵ A Future Multi-Engine Stack ● An execution engine tailored for different data scales ○ <= 1TB: DuckDB and Friends ○ 1 - 10TB: Spark, Dask, Ray, etc. ○ > 10TB: Hardware-accelerated Processing (e.g. Theseus) ● A portable language front end (Ibis, Malloy, PRQL, or a standard transpilable SQL) ● Arrow-native API and wire transport