SlideShare a Scribd company logo
1 of 28
Download to read offline
Apache Iceberg
Ryan Blue
2019 Big Data Orchestration Summit
Netflix’s Data Warehouse
â—Ź Smarter processing engines
â—‹ CBO, better join implementations
â—‹ Result set caching, materialized views
â—Ź Reduce manual data maintenance
â—‹ Data librarian services
â—‹ Declarative instead of imperative
5-year Challenges
â—Ź Unsafe operations are everywhere
â—‹ Writing to multiple partitions
â—‹ Renaming a column
â—Ź Interaction with object stores causes major headaches
â—‹ Eventual consistency to performance problems
○ Output committers can’t fix it
â—Ź Endless scale challenges
Problem Whack-a-mole
What is Iceberg?
Iceberg is a scalable format for
tables with a lot of best
practices built in.
A format?
We already have Parquet, Avro
and ORC . . .
A table format.
â—Ź File formats help you modify or skip data in a single file
â—Ź Table formats do the same thing for a collection of files
â—Ź To demonstrate this, consider Hive tables . . .
A table format
â—Ź Key idea: organize data in a directory tree
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
Hive Tables
â—Ź Filter: WHERE date = '20180513' AND hour = 19
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
Hive Tables
â—Ź Problem: too much directory listing for large tables
â—Ź Solution: use HMS to track partitions
date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19
date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20
â—Ź The file system still tracks the files in each partition . . .
Hive Metastore
â—Ź State is kept in both the metastore and in a file system
â—Ź Changes are not atomic without locking
â—Ź Requires directory listing
â—‹ O(n) listing calls, n = # matching partitions
â—‹ Eventual consistency breaks correctness
Hive Tables: Problems
â—Ź Everything supports Hive tables*
â—‹ Engines: Hive, Spark, Presto, Flink, Pig
â—‹ Tools: Hudi, NiFi, Flume, Sqoop
â—Ź Simplicity and ubiquity have made Hive tables indispensable
â—Ź The whole ecosystem uses the same at-rest data!
Hive Tables: Benefits
Iceberg
â—Ź An open spec and community for at-rest data interchange
â—‹ Maintain a clear spec for the format
â—‹ Design for multiple implementations across languages
â—‹ Support needs across projects to avoid fragmentation
Iceberg’s Goals
â—Ź Improve scale and reliability
â—‹ Work on a single node, scale to a cluster
â—‹ All changes are atomic, with serializable isolation
â—‹ Native support for cloud object stores
â—‹ Support many concurrent writers
Iceberg’s Goals
â—Ź Fix persistent usability problems
â—‹ In-place evolution for schema and layout (no side-effects)
â—‹ Hide partitioning: insulate queries from physical layout
â—‹ Support time-travel, rollback, and metadata inspection
â—‹ Configure tables, not jobs
â—Ź Tables should have no unpleasant surprises
Iceberg’s Goals
â—Ź Key idea: track all files in a table over time
â—‹ A snapshot is a complete list of files in a table
â—‹ Each write produces and commits a new snapshot
Iceberg’s Design
S1 S2 S3 ...
S1 S2 S3 ...
R W
â—Ź Readers use the current snapshot
â—Ź Writers optimistically create new snapshots, then commit
Iceberg’s Design
In reality, it’s a bit more
complicated.
â—Ź All changes are atomic
â—Ź No expensive (or inconsistent) file system operations
â—Ź Snapshots are indexed for scan planning on a single node
â—Ź CBO metrics are reliable
â—Ź Versions for incremental updates and materialized views
Iceberg Design Benefits
Iceberg at Netflix
â—Ź Production tables: tens of petabytes, millions of partitions
â—‹ Scan planning fits on a single node
â—‹ Advanced filtering enables more use cases
â—‹ Overall performance is better
â—Ź Low latency queries are faster for large tables
Scale
â—Ź Production Flink pipeline writing in 3 AWS regions
â—Ź Lift service moving data into a single region
â—Ź Merge service compacting small files
Concurrency
â—Ź Rollback is popular
â—Ź Metadata tables
â—‹ Track down the version a job read
â—‹ Find the process that wrote a bad version
Usability
â—Ź Spark vectorization for faster bulk reads
â—‹ Presto vectorization already done
â—Ź Row-level delete encodings
â—‹ MERGE INTO
â—‹ ID equality predicates
Future Work
Thank you!
Questions?
Ryan Blue
rblue@netflix.com

More Related Content

What's hot

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
 

What's hot (20)

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 

Similar to Apache Iceberg - A Table Format for Hige Analytic Datasets

The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdfTusharAgarwal49094
 
Change data capture
Change data captureChange data capture
Change data captureRon Barabash
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleEduardo Silva Pereira
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsAlluxio, Inc.
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...Hisham Mardam-Bey
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 

Similar to Apache Iceberg - A Table Format for Hige Analytic Datasets (20)

The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
Change data capture
Change data captureChange data capture
Change data capture
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
RubiX
RubiXRubiX
RubiX
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 

More from Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

More from Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Recently uploaded

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 

Recently uploaded (20)

Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 

Apache Iceberg - A Table Format for Hige Analytic Datasets

  • 1. Apache Iceberg Ryan Blue 2019 Big Data Orchestration Summit
  • 3. â—Ź Smarter processing engines â—‹ CBO, better join implementations â—‹ Result set caching, materialized views â—Ź Reduce manual data maintenance â—‹ Data librarian services â—‹ Declarative instead of imperative 5-year Challenges
  • 4. â—Ź Unsafe operations are everywhere â—‹ Writing to multiple partitions â—‹ Renaming a column â—Ź Interaction with object stores causes major headaches â—‹ Eventual consistency to performance problems â—‹ Output committers can’t fix it â—Ź Endless scale challenges Problem Whack-a-mole
  • 6. Iceberg is a scalable format for tables with a lot of best practices built in.
  • 7. A format? We already have Parquet, Avro and ORC . . .
  • 9. â—Ź File formats help you modify or skip data in a single file â—Ź Table formats do the same thing for a collection of files â—Ź To demonstrate this, consider Hive tables . . . A table format
  • 10. â—Ź Key idea: organize data in a directory tree date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... Hive Tables
  • 11. â—Ź Filter: WHERE date = '20180513' AND hour = 19 date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... Hive Tables
  • 12. â—Ź Problem: too much directory listing for large tables â—Ź Solution: use HMS to track partitions date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19 date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20 â—Ź The file system still tracks the files in each partition . . . Hive Metastore
  • 13. â—Ź State is kept in both the metastore and in a file system â—Ź Changes are not atomic without locking â—Ź Requires directory listing â—‹ O(n) listing calls, n = # matching partitions â—‹ Eventual consistency breaks correctness Hive Tables: Problems
  • 14. â—Ź Everything supports Hive tables* â—‹ Engines: Hive, Spark, Presto, Flink, Pig â—‹ Tools: Hudi, NiFi, Flume, Sqoop â—Ź Simplicity and ubiquity have made Hive tables indispensable â—Ź The whole ecosystem uses the same at-rest data! Hive Tables: Benefits
  • 16. â—Ź An open spec and community for at-rest data interchange â—‹ Maintain a clear spec for the format â—‹ Design for multiple implementations across languages â—‹ Support needs across projects to avoid fragmentation Iceberg’s Goals
  • 17. â—Ź Improve scale and reliability â—‹ Work on a single node, scale to a cluster â—‹ All changes are atomic, with serializable isolation â—‹ Native support for cloud object stores â—‹ Support many concurrent writers Iceberg’s Goals
  • 18. â—Ź Fix persistent usability problems â—‹ In-place evolution for schema and layout (no side-effects) â—‹ Hide partitioning: insulate queries from physical layout â—‹ Support time-travel, rollback, and metadata inspection â—‹ Configure tables, not jobs â—Ź Tables should have no unpleasant surprises Iceberg’s Goals
  • 19. â—Ź Key idea: track all files in a table over time â—‹ A snapshot is a complete list of files in a table â—‹ Each write produces and commits a new snapshot Iceberg’s Design S1 S2 S3 ...
  • 20. S1 S2 S3 ... R W â—Ź Readers use the current snapshot â—Ź Writers optimistically create new snapshots, then commit Iceberg’s Design
  • 21. In reality, it’s a bit more complicated.
  • 22. â—Ź All changes are atomic â—Ź No expensive (or inconsistent) file system operations â—Ź Snapshots are indexed for scan planning on a single node â—Ź CBO metrics are reliable â—Ź Versions for incremental updates and materialized views Iceberg Design Benefits
  • 24. â—Ź Production tables: tens of petabytes, millions of partitions â—‹ Scan planning fits on a single node â—‹ Advanced filtering enables more use cases â—‹ Overall performance is better â—Ź Low latency queries are faster for large tables Scale
  • 25. â—Ź Production Flink pipeline writing in 3 AWS regions â—Ź Lift service moving data into a single region â—Ź Merge service compacting small files Concurrency
  • 26. â—Ź Rollback is popular â—Ź Metadata tables â—‹ Track down the version a job read â—‹ Find the process that wrote a bad version Usability
  • 27. â—Ź Spark vectorization for faster bulk reads â—‹ Presto vectorization already done â—Ź Row-level delete encodings â—‹ MERGE INTO â—‹ ID equality predicates Future Work