SlideShare a Scribd company logo
1 of 25
Reshape Data Lake (As of 2020.07)
Eric Sun @ LinkedIn
https://www.linkedin.com/in/ericsun
SF Big Analytics
Similar Presentation/Blog(s)
https://databricks.com/session_na20/a-thorough-comparison-of-delta-lake-
iceberg-and-hudi
https://databricks.com/session_eu19/end-to-end-spark-tensorflow-pytorch-
pipelines-with-databricks-delta
https://bit.ly/comparison-of-delta-iceberg-hudi-by-domisj
https://bit.ly/acid-iceberg-delta-comparison-by-wssbck
Disclaimer
The views expressed in this presentation are those of the author and do not reflect any policy or
position of the employers of the author. Audience may verify the anecdotes mentioned below.
Vocabulary & Jargon
● T+1: event/transaction time plus 1 day - typical daily-batch
T+0: realtime process which can deliver insight with minimal delay
T+0.000694: mintely-batch; T+0.041666: hourly-batch
● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva)
● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering
● DML: Insert + Delete + Update + Upsert/Merge
● Time Travel: isolate & preserver multiple snapshot versions
● SCD-2: type 2 of multi-versioned data model to provide time travel
● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL
● Streaming & Batch Unification: union historical bounded data with
continuous stream; interactively query both anytime
Data Warehouse Data Lake v1 Data Lake v2
Relational DB based MPP
ETL done by IT team
ELT inside MPP
Star schema
OLAP and BI focused
SQL is the main DSL
ODBC + JDBC as ⇿ interface
<Expensive to scale …>
Limited UD*F to run R and Data
Mining inside database
HDFS + NoSQL
ETL done by Java folks
Nested schema or no schema
Hive used by non-engineers
Export data back to RDBMS
for OLAP/BI
M/R API & DSL dominated
Scalable ML became possible
<Hard to operate …>
UD*F & SerDe made easier
Cloud + HTAP/MPP + NoSQL
ETL done by data people in
Spark and Presto
Data model and schema matter
again
Streaming + Batch ⇨ unified
More expressed in SQL + Python
ML as a critical use case
<Too confused to migrate…>
Non-JVM engines emerge
Share So Much
Despite of all the marketing
buzzwords and manipulations,
‘data lakehouse’, ‘data lake’,
and ‘data warehouse’ are all
there to solve the same data
integration and insight
generation problems.
The implementation will
continue to evolve as the new
hardware and software
become viable and practical.
● ACID
● Mutable (Delete, Update, Compact)
● Schema (DDL and Evolution)
● Metadata (Rich, Performant)
● Open (Format, API, Tooling, Adoption)
● Fast (Optimized for Various Patterns)
● Extensible (User-defined ***, Federation)
● Intuitive (Data-centric Operation/Language)
● Productive (Achieve more with less)
● Practical (Join, Aggregate, Cache, View)
In Common
Solution Architecture Template
Sources
Ads
BI/OLAP
Machine Learning
Deep Learning
Observability
Recommendation
A/B Test
Storage
Data Format and SerDe
Metadata Catalog and Table API
Unified Data Interface
CDC
Ingestion
T+0 or T+0.000694
T+0.0416 or T+1
...
Data Analytics in Cloud Storage
● Object Store File System
○ There is no hierarchy semantics to rename or inherit
○ Object is not appendable (in general)
○ Metadata is limited to a few KB
● REST is easy to program but RPC is much faster
○ Job/query planning step needs a lot of small scans (it is chatty)
○ 4MB cache block size may be inefficient for metadata operations
● Hadoop stack is tightly-coupled with HDFS notions
○ Hive and Spark (originally) were not optimized for object stores
○ Running HDFS as a cache/intermediate layer on a VM fleet can be
useful yet suboptimal (and operational heavy)
○ Data locally still matters for SLA-sensitive batch jobs
Is Not
Big Data becomes too big, even Metadata
● Computation cost keep rising for big data
○ Partitioning the files by date is not enough
○ Hot and warm data sizes are still very big (how to save $$$)
○ Analytics often scan big data files but discard 90% records and 80%
fields. The CPU, memory, network and I/O cost is billed for 100%
○ Columnar format has skipping index and projection pushdown, but
how to fetch them swiftly
● Hive Metadata only manages directory (HIVE-9452 abandoned)
○ Commits can happen at file or file group level (instead of directory)
○ High-performance engines need better file layout and rich metadata at
field level for each segment/chunk in a file
○ Process metadata via Java ORM?
Immutable or Mutable
● Big data is all about immutable schemaless data
○ To get useful insights and features out the raw data, we still have to
dedupe, transform, conform, merge, aggregate, and backfill
○ Schema evolution happens frequently when merge & backfill occurs
● Storage is infinite and compute is cheap
○ Why not rewriting the entire data file or directory all the time
○ If it is slow, increase the number of partitions and executors
● Streaming and Batch Unification requires a decent incremental logic
○ Store granularly with ACID isolation and clear watermarks
○ Process incrementally without partial reads or duplicates
○ Evolve reliably with enough flexibility
Are All Open Standards Equal?
● Hive 3.x
○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor)
○ Streaming Ingestion API, LLAP (daemon, caching, faster execution)
● Iceberg
○ Flexible Field Schema and Partition Layout Evolution (S3-first)
○ Hidden Partition (expression-based) and Bucket Transformation
● Delta Lake
○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2
○ Fully supported in SparkSQL, PySpark and Delta Engine
● Hudi
○ Optimized UPSERT with indexing (record key, file id, partition path)
○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)
Why Iceberg is so cool?
● Netflix is the most advanced AWS flagship partner
○ S3 is very scalable but a little bit over-simplified
○ Solve the critical cloud storage problems:
■ Avoid rename
■ Avoid directory hierarchy and naming convention
■ Aggregate (index) metadata into a compacted (manifest) file
● Netflix has migrated to Flink for stream processing
○ Fast ETL/analytics are needed to respond to its non-stop VOD
○ w/ One of the biggest Cassandra cluster (less mutable headache)
○ No urgent need for DML yet
● Netflix uses multiple data platforms/engines, and migrates faster than ...
○ Support other file formats, engines, schema, bucketing by nature
Why Delta Lake is so handy?
● If you love to use Spark for ETL (Steaming & Batch), Delta
Lake just makes it so much more powerful
○ The API and SQL syntax are so easy to use (especially for data folks)
○ Wide range of patterns provided by paid customers and OSS community
○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds
● Databricks has full control and moves very fast
○ v0.2 (cloud storage support: June 2019)
○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019)
○ v0.5 (DML & compaction performance, Presto integration: Dec 2019)
○ v0.6 (Schema evolution during merge, read by path: Apr 2020)
○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)
Why Hudi is faster?
● Uber is a true fast-data company
○ Their marketing place and supply-demand-matching business model
seriously depends on near real-time analytics:
■ Directly upsert MySql BIN log to Hudi table
■ Frequently bulk dump Cassandra is obviously infeasible
■ record_key is indexed (file names + bloom filters) to speed up
■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read
■ Snapshot query is faster, while Incremental query has low latency
● Uber is also committed to Flink
● Uber mainly builds its own data centers and HDFS clusters
○ So Hudi is mainly optimized for on-prem HDFS with Hive convention
○ GCP and AWS support was added later
Code Snippets - Delta
spark.readStream.format("delta").load("/path/to/delta/events")
deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
# Upsert (merge) new data
newData = spark.range(0, 20)
deltaTable.alias("oldData") 
.merge(
newData.alias("newData"),
"oldData.id = newData.id") 
.whenMatchedUpdate(set = { "id": col("newData.id") }) 
.whenNotMatchedInsert(values = { "id": col("newData.id") }) 
.execute()
val df = spark.read.format(“delta”).load("/path/to/my/table@v5238")
// ---- Spark SQL ----
SELECT * FROM events -- query table in the metastore
SELECT * FROM delta.`/delta/events` -- query table by path
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000"
SELECT count(*) FROM my_table VERSION AS OF 5238
UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'
Code Snippets - Hudi
val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
// load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
// since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from
hudi_trips_snapshot").show()
// -------------------
val beginTime = "000" // Represents all commits > this time.
val endTime = commits(commits.length - 2) // point in time to query
// incrementally query data
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare
> 20.0").show()
Code Snippets - Iceberg
CREATE TABLE prod.db.sample_table (
id bigint,
data string,
category string,
ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)
SELECT * FROM prod.db.sample_table.files
INSERT OVERWRITE prod.my_app.logs
SELECT uuid, first(level), first(ts), first(message)
FROM prod.my_app.logs
WHERE cast(ts as date) = '2020-07-01'
GROUP BY uuid
spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table")
// time travel to snapshot with ID 10963874102873L
spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")
Time Travel
● Time Travel is focused on keeping both Batch and Streaming
jobs isolated from the Concurrent Reads & Writes
● Typical Range for Time Travel is 7 ~ 30 days
● Machine Learning (Feature reGeneration) often needs to
travel to 3~24 months back
○ Need to reduce the precision/granularity of commits kept
in Data Lake (compact the logs to daily or monthly level)
■ Monthly baseline/snapshot + daily delta/changes
○ Consider a more advanced SCD-2 data model for ML
What Else Should be Part of Data Lake?
● Catalog (next-generation metastore alternatives)
○ Daemon service: scalable, easy to update and query
○ Federation across data centers (across cloud and on-premises)
● Better file format and in-memory columnar format
○ Less SerDe overhead, zero-copy, directly vectorized operation on
compressed data (Artus-like). Tungsten v2 (Arrow-like)
● Performance and Data Management (for OLAP and AI)
○ New compute engines (non-JVM based) with smart caching and pre-
aggregation & materialized view
○ Mechanism to enable Time Travel with more flexible and wider range
○ Rich DSL with code generation and pushdown capability for faster AI
training and inference
How to
What are the pain points?
Each Data Lake framework has
its own emphasis, please find
the alignment of your pain
points accordingly.
● Motivations
Smoother integration with existing development
language and compute engine?
Contribute to the framework to solve new problems?
Have more control of the infrastructure, is the
framework’s open source governance friendly?
● Restrictions
...
Choose?
⧫ Delta Lake + Spark + Delta Engine +
Python support will effectively help
Databricks pull ahead in the race.
⧫ Flink community is all in for Iceberg.
⧫ GCP BigQuery, EMR, and Azure Synapse
(will) support reading from all table
formats, so you can lift-and-shift to ...
What’s next?
Data Lake can do more
Can be faster
Can be easier
Additional Readings
● Gartner Research
○ Are You Shifting Your Problems to the Cloud or Solving Them?
○ Demystifying Cloud Data Warehouse Characteristics
● Google
○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw)
○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor)
● Uber
○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental)
● Alibaba
○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf)
○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink)
○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)
Data Lake implementations are still
evolving, don’t hold your breath for
the single best choice. Roll up
sleeves and build practical solutions
with 2 or 3 options combined.
Computation engine gravity/bias
will directly reshape the waterscape.
Thanks!
Presentation URL:
https://bit.ly/SFBA0728
Blog:
http://bit.ly/iceberg-delta-hudi-hive

More Related Content

What's hot

Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for DatabricksDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 

What's hot (20)

Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 

Similar to Reshape Data Lake (as of 2020.07)

Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...Stavros Papadopoulos
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 

Similar to Reshape Data Lake (as of 2020.07) (20)

Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Reshape Data Lake (as of 2020.07)

  • 1. Reshape Data Lake (As of 2020.07) Eric Sun @ LinkedIn https://www.linkedin.com/in/ericsun SF Big Analytics
  • 3. Vocabulary & Jargon ● T+1: event/transaction time plus 1 day - typical daily-batch T+0: realtime process which can deliver insight with minimal delay T+0.000694: mintely-batch; T+0.041666: hourly-batch ● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva) ● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering ● DML: Insert + Delete + Update + Upsert/Merge ● Time Travel: isolate & preserver multiple snapshot versions ● SCD-2: type 2 of multi-versioned data model to provide time travel ● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL ● Streaming & Batch Unification: union historical bounded data with continuous stream; interactively query both anytime
  • 4. Data Warehouse Data Lake v1 Data Lake v2 Relational DB based MPP ETL done by IT team ELT inside MPP Star schema OLAP and BI focused SQL is the main DSL ODBC + JDBC as ⇿ interface <Expensive to scale …> Limited UD*F to run R and Data Mining inside database HDFS + NoSQL ETL done by Java folks Nested schema or no schema Hive used by non-engineers Export data back to RDBMS for OLAP/BI M/R API & DSL dominated Scalable ML became possible <Hard to operate …> UD*F & SerDe made easier Cloud + HTAP/MPP + NoSQL ETL done by data people in Spark and Presto Data model and schema matter again Streaming + Batch ⇨ unified More expressed in SQL + Python ML as a critical use case <Too confused to migrate…> Non-JVM engines emerge
  • 5. Share So Much Despite of all the marketing buzzwords and manipulations, ‘data lakehouse’, ‘data lake’, and ‘data warehouse’ are all there to solve the same data integration and insight generation problems. The implementation will continue to evolve as the new hardware and software become viable and practical. ● ACID ● Mutable (Delete, Update, Compact) ● Schema (DDL and Evolution) ● Metadata (Rich, Performant) ● Open (Format, API, Tooling, Adoption) ● Fast (Optimized for Various Patterns) ● Extensible (User-defined ***, Federation) ● Intuitive (Data-centric Operation/Language) ● Productive (Achieve more with less) ● Practical (Join, Aggregate, Cache, View) In Common
  • 6. Solution Architecture Template Sources Ads BI/OLAP Machine Learning Deep Learning Observability Recommendation A/B Test Storage Data Format and SerDe Metadata Catalog and Table API Unified Data Interface CDC Ingestion T+0 or T+0.000694 T+0.0416 or T+1 ...
  • 7. Data Analytics in Cloud Storage ● Object Store File System ○ There is no hierarchy semantics to rename or inherit ○ Object is not appendable (in general) ○ Metadata is limited to a few KB ● REST is easy to program but RPC is much faster ○ Job/query planning step needs a lot of small scans (it is chatty) ○ 4MB cache block size may be inefficient for metadata operations ● Hadoop stack is tightly-coupled with HDFS notions ○ Hive and Spark (originally) were not optimized for object stores ○ Running HDFS as a cache/intermediate layer on a VM fleet can be useful yet suboptimal (and operational heavy) ○ Data locally still matters for SLA-sensitive batch jobs Is Not
  • 8. Big Data becomes too big, even Metadata ● Computation cost keep rising for big data ○ Partitioning the files by date is not enough ○ Hot and warm data sizes are still very big (how to save $$$) ○ Analytics often scan big data files but discard 90% records and 80% fields. The CPU, memory, network and I/O cost is billed for 100% ○ Columnar format has skipping index and projection pushdown, but how to fetch them swiftly ● Hive Metadata only manages directory (HIVE-9452 abandoned) ○ Commits can happen at file or file group level (instead of directory) ○ High-performance engines need better file layout and rich metadata at field level for each segment/chunk in a file ○ Process metadata via Java ORM?
  • 9. Immutable or Mutable ● Big data is all about immutable schemaless data ○ To get useful insights and features out the raw data, we still have to dedupe, transform, conform, merge, aggregate, and backfill ○ Schema evolution happens frequently when merge & backfill occurs ● Storage is infinite and compute is cheap ○ Why not rewriting the entire data file or directory all the time ○ If it is slow, increase the number of partitions and executors ● Streaming and Batch Unification requires a decent incremental logic ○ Store granularly with ACID isolation and clear watermarks ○ Process incrementally without partial reads or duplicates ○ Evolve reliably with enough flexibility
  • 10. Are All Open Standards Equal? ● Hive 3.x ○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor) ○ Streaming Ingestion API, LLAP (daemon, caching, faster execution) ● Iceberg ○ Flexible Field Schema and Partition Layout Evolution (S3-first) ○ Hidden Partition (expression-based) and Bucket Transformation ● Delta Lake ○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2 ○ Fully supported in SparkSQL, PySpark and Delta Engine ● Hudi ○ Optimized UPSERT with indexing (record key, file id, partition path) ○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)
  • 11. Why Iceberg is so cool? ● Netflix is the most advanced AWS flagship partner ○ S3 is very scalable but a little bit over-simplified ○ Solve the critical cloud storage problems: ■ Avoid rename ■ Avoid directory hierarchy and naming convention ■ Aggregate (index) metadata into a compacted (manifest) file ● Netflix has migrated to Flink for stream processing ○ Fast ETL/analytics are needed to respond to its non-stop VOD ○ w/ One of the biggest Cassandra cluster (less mutable headache) ○ No urgent need for DML yet ● Netflix uses multiple data platforms/engines, and migrates faster than ... ○ Support other file formats, engines, schema, bucketing by nature
  • 12. Why Delta Lake is so handy? ● If you love to use Spark for ETL (Steaming & Batch), Delta Lake just makes it so much more powerful ○ The API and SQL syntax are so easy to use (especially for data folks) ○ Wide range of patterns provided by paid customers and OSS community ○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds ● Databricks has full control and moves very fast ○ v0.2 (cloud storage support: June 2019) ○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019) ○ v0.5 (DML & compaction performance, Presto integration: Dec 2019) ○ v0.6 (Schema evolution during merge, read by path: Apr 2020) ○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)
  • 13. Why Hudi is faster? ● Uber is a true fast-data company ○ Their marketing place and supply-demand-matching business model seriously depends on near real-time analytics: ■ Directly upsert MySql BIN log to Hudi table ■ Frequently bulk dump Cassandra is obviously infeasible ■ record_key is indexed (file names + bloom filters) to speed up ■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read ■ Snapshot query is faster, while Incremental query has low latency ● Uber is also committed to Flink ● Uber mainly builds its own data centers and HDFS clusters ○ So Hudi is mainly optimized for on-prem HDFS with Hive convention ○ GCP and AWS support was added later
  • 14. Code Snippets - Delta spark.readStream.format("delta").load("/path/to/delta/events") deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table") # Upsert (merge) new data newData = spark.range(0, 20) deltaTable.alias("oldData") .merge( newData.alias("newData"), "oldData.id = newData.id") .whenMatchedUpdate(set = { "id": col("newData.id") }) .whenNotMatchedInsert(values = { "id": col("newData.id") }) .execute() val df = spark.read.format(“delta”).load("/path/to/my/table@v5238") // ---- Spark SQL ---- SELECT * FROM events -- query table in the metastore SELECT * FROM delta.`/delta/events` -- query table by path SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1) SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000" SELECT count(*) FROM my_table VERSION AS OF 5238 UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'
  • 15. Code Snippets - Hudi val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*") // load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery // since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show() spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show() // ------------------- val beginTime = "000" // Represents all commits > this time. val endTime = commits(commits.length - 2) // point in time to query // incrementally query data val tripsPointInTimeDF = spark.read.format("hudi"). option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). option(END_INSTANTTIME_OPT_KEY, endTime). load(basePath) tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
  • 16. Code Snippets - Iceberg CREATE TABLE prod.db.sample_table ( id bigint, data string, category string, ts timestamp) USING iceberg PARTITIONED BY (bucket(16, id), days(ts), category) SELECT * FROM prod.db.sample_table.files INSERT OVERWRITE prod.my_app.logs SELECT uuid, first(level), first(ts), first(message) FROM prod.my_app.logs WHERE cast(ts as date) = '2020-07-01' GROUP BY uuid spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table") // time travel to October 26, 1986 at 01:21:00 spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table") // time travel to snapshot with ID 10963874102873L spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")
  • 17. Time Travel ● Time Travel is focused on keeping both Batch and Streaming jobs isolated from the Concurrent Reads & Writes ● Typical Range for Time Travel is 7 ~ 30 days ● Machine Learning (Feature reGeneration) often needs to travel to 3~24 months back ○ Need to reduce the precision/granularity of commits kept in Data Lake (compact the logs to daily or monthly level) ■ Monthly baseline/snapshot + daily delta/changes ○ Consider a more advanced SCD-2 data model for ML
  • 18. What Else Should be Part of Data Lake? ● Catalog (next-generation metastore alternatives) ○ Daemon service: scalable, easy to update and query ○ Federation across data centers (across cloud and on-premises) ● Better file format and in-memory columnar format ○ Less SerDe overhead, zero-copy, directly vectorized operation on compressed data (Artus-like). Tungsten v2 (Arrow-like) ● Performance and Data Management (for OLAP and AI) ○ New compute engines (non-JVM based) with smart caching and pre- aggregation & materialized view ○ Mechanism to enable Time Travel with more flexible and wider range ○ Rich DSL with code generation and pushdown capability for faster AI training and inference
  • 19. How to What are the pain points? Each Data Lake framework has its own emphasis, please find the alignment of your pain points accordingly. ● Motivations Smoother integration with existing development language and compute engine? Contribute to the framework to solve new problems? Have more control of the infrastructure, is the framework’s open source governance friendly? ● Restrictions ... Choose?
  • 20. ⧫ Delta Lake + Spark + Delta Engine + Python support will effectively help Databricks pull ahead in the race. ⧫ Flink community is all in for Iceberg. ⧫ GCP BigQuery, EMR, and Azure Synapse (will) support reading from all table formats, so you can lift-and-shift to ...
  • 21. What’s next? Data Lake can do more Can be faster Can be easier
  • 22. Additional Readings ● Gartner Research ○ Are You Shifting Your Problems to the Cloud or Solving Them? ○ Demystifying Cloud Data Warehouse Characteristics ● Google ○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw) ○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor) ● Uber ○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental) ● Alibaba ○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf) ○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink) ○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)
  • 23.
  • 24. Data Lake implementations are still evolving, don’t hold your breath for the single best choice. Roll up sleeves and build practical solutions with 2 or 3 options combined. Computation engine gravity/bias will directly reshape the waterscape.

Editor's Notes

  1. The views expressed in this presentation are those of the author and do not reflect any policy or position of the employers of the author.
  2. IA = Infrequent Access; NL = Near Line; CL = Code Line; https://flink.apache.org/news/2019/02/13/unified-batch-streaming-blink.html
  3. During v1 time, there are several attempts for non-JVM engines, but none of them have really thrived. GPU, C++ and LLVM are really changing the game of Deep Learning and OLAP. HDFS are reaching it peak time and it starts fading away.
  4. if all you have is a hammer, everything looks like a nail
  5. The Druid/Pinot (near real time analytics) block can be merged into the Data Lake with T+0 ingestion and processing capability. It can also be replaced by HTAP (such as TiDB) as a super ODS.
  6. AWS EFS is really a NFS/NAS solution, so it can’t even replace HDFS on S3. Use EmrFileSystem instead. And s3a:// has https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/s3-limitations.html Azure Data Lake Storage Gen2 is almost capable of replacing HDFS. abfs:// Google Colossus is years ahead of OSS, a true distributed file system. HIVE-14269, HIVE-14270, HIVE-20517, HADOOP-15364, HADOOP-15281 Hive ACID is not allowed if S3 is the storage layer (Hudi or others can be used as SerDe)
  7. Snowflake uses FoundationDB to organize a lot of metadata to speed up its Query Processing. https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/ S3 Select was launched Apr 2018 to provide some pushdown (Sep 2018 for Parquet) (Nov 2018, output committer to avoid rename)
  8. Record-grain mutable is expensive, but how about min-batch level? GDPR, CCPA, IDPC and … affect offline big data as well.
  9. Iceberg is mainly optimized for Parquet, but its spec and API are open to support ORC and Avro too. The Bucket Transformation is designed to work across Hive, Spark, Presto and Flink.
  10. Clearly distinguish and handle processing_time (a.k.a. arrival_time) vs. event_time (a.k.a. payload_time or transaction_time) In short, Hudi can efficiently update/reconcile the late-arrival records to the proper partition. https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
  11. https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html https://docs.delta.io/0.7.0/delta-batch.html
  12. Very typical Hive style. Fine-grain control.
  13. Cool stuff
  14. Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
  15. Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
  16. Anecdote: Huawei was donating CarbonData into open source Spark a few years ago, but maybe Delta had been the way to go already, CarbonData never made to a file format bundled in Spark. CarbonData is a more comprehensive columnar format that supports rich indexing and even DML operations at SerDe level. The latest FusionInsights MRS 8.0 is realizing the mutable Data Lake with streaming & batch combined on top of CarbonData. It will not be surprising if some of the Iceberg contributors & adopters have similar worry about Delta Lake.
  17. Huawei CarbonData anecdote:
  18. https://www.qlik.com/us/-/media/files/resource-library/global-us/register/ebooks/eb-cloud-data-warehouse-comparison-ebook-en.pdf https://www.gartner.com/doc/reprints?id=1-1ZA6E2JU&ct=200619&st=sb (Cloud Data Warehouse: Are You Shifting Your Problems to the Cloud or Solving Them?)
  19. We need to speculate where Databricks is forging forward next? (Data Lake + ETL + ML + OLAP + DL + SaaS/Serverless + Data Management + …) What shall we learn from Snowflake’s architecture and success? (Data Lake should be fast and intuitive to use, Metadata is so important to optimize the query performance) Anecdote: Snowflake’s IPO market cap is about 10x bigger than Cloudera, that should tell something about how useful it is.