Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
9. â—Ź File formats help you modify or skip data in a single file
â—Ź Table formats do the same thing for a collection of files
â—Ź To demonstrate this, consider Hive tables . . .
A table format
10. â—Ź Key idea: organize data in a directory tree
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
Hive Tables
12. â—Ź Problem: too much directory listing for large tables
â—Ź Solution: use HMS to track partitions
date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19
date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20
â—Ź The file system still tracks the files in each partition . . .
Hive Metastore
13. â—Ź State is kept in both the metastore and in a file system
â—Ź Changes are not atomic without locking
â—Ź Requires directory listing
â—‹ O(n) listing calls, n = # matching partitions
â—‹ Eventual consistency breaks correctness
Hive Tables: Problems
14. â—Ź Everything supports Hive tables*
â—‹ Engines: Hive, Spark, Presto, Flink, Pig
â—‹ Tools: Hudi, NiFi, Flume, Sqoop
â—Ź Simplicity and ubiquity have made Hive tables indispensable
â—Ź The whole ecosystem uses the same at-rest data!
Hive Tables: Benefits
16. â—Ź An open spec and community for at-rest data interchange
â—‹ Maintain a clear spec for the format
â—‹ Design for multiple implementations across languages
â—‹ Support needs across projects to avoid fragmentation
Iceberg’s Goals
17. â—Ź Improve scale and reliability
â—‹ Work on a single node, scale to a cluster
â—‹ All changes are atomic, with serializable isolation
â—‹ Native support for cloud object stores
â—‹ Support many concurrent writers
Iceberg’s Goals
18. â—Ź Fix persistent usability problems
â—‹ In-place evolution for schema and layout (no side-effects)
â—‹ Hide partitioning: insulate queries from physical layout
â—‹ Support time-travel, rollback, and metadata inspection
â—‹ Configure tables, not jobs
â—Ź Tables should have no unpleasant surprises
Iceberg’s Goals
19. â—Ź Key idea: track all files in a table over time
â—‹ A snapshot is a complete list of files in a table
â—‹ Each write produces and commits a new snapshot
Iceberg’s Design
S1 S2 S3 ...
20. S1 S2 S3 ...
R W
â—Ź Readers use the current snapshot
â—Ź Writers optimistically create new snapshots, then commit
Iceberg’s Design
22. â—Ź All changes are atomic
â—Ź No expensive (or inconsistent) file system operations
â—Ź Snapshots are indexed for scan planning on a single node
â—Ź CBO metrics are reliable
â—Ź Versions for incremental updates and materialized views
Iceberg Design Benefits
24. â—Ź Production tables: tens of petabytes, millions of partitions
â—‹ Scan planning fits on a single node
â—‹ Advanced filtering enables more use cases
â—‹ Overall performance is better
â—Ź Low latency queries are faster for large tables
Scale
25. â—Ź Production Flink pipeline writing in 3 AWS regions
â—Ź Lift service moving data into a single region
â—Ź Merge service compacting small files
Concurrency
26. â—Ź Rollback is popular
â—Ź Metadata tables
â—‹ Track down the version a job read
â—‹ Find the process that wrote a bad version
Usability
27. â—Ź Spark vectorization for faster bulk reads
â—‹ Presto vectorization already done
â—Ź Row-level delete encodings
â—‹ MERGE INTO
â—‹ ID equality predicates
Future Work