Reshape Data Lake (as of 2020.07)

Reshape Data Lake (As of 2020.07)
Eric Sun @ LinkedIn
https://www.linkedin.com/in/ericsun
SF Big Analytics

Similar Presentation/Blog(s)
https://databricks.com/session_na20/a-thorough-comparison-of-delta-lake-
iceberg-and-hudi
https://databricks.com/session_eu19/end-to-end-spark-tensorflow-pytorch-
pipelines-with-databricks-delta
https://bit.ly/comparison-of-delta-iceberg-hudi-by-domisj
https://bit.ly/acid-iceberg-delta-comparison-by-wssbck
Disclaimer
The views expressed in this presentation are those of the author and do not reflect any policy or
position of the employers of the author. Audience may verify the anecdotes mentioned below.

Vocabulary & Jargon
● T+1: event/transaction time plus 1 day - typical daily-batch
T+0: realtime process which can deliver insight with minimal delay
T+0.000694: mintely-batch; T+0.041666: hourly-batch
● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva)
● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering
● DML: Insert + Delete + Update + Upsert/Merge
● Time Travel: isolate & preserver multiple snapshot versions
● SCD-2: type 2 of multi-versioned data model to provide time travel
● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL
● Streaming & Batch Unification: union historical bounded data with
continuous stream; interactively query both anytime

Data Warehouse Data Lake v1 Data Lake v2
Relational DB based MPP
ETL done by IT team
ELT inside MPP
Star schema
OLAP and BI focused
SQL is the main DSL
ODBC + JDBC as ⇿ interface
<Expensive to scale …>
Limited UD*F to run R and Data
Mining inside database
HDFS + NoSQL
ETL done by Java folks
Nested schema or no schema
Hive used by non-engineers
Export data back to RDBMS
for OLAP/BI
M/R API & DSL dominated
Scalable ML became possible
<Hard to operate …>
UD*F & SerDe made easier
Cloud + HTAP/MPP + NoSQL
ETL done by data people in
Spark and Presto
Data model and schema matter
again
Streaming + Batch ⇨ unified
More expressed in SQL + Python
ML as a critical use case
<Too confused to migrate…>
Non-JVM engines emerge

Share So Much
Despite of all the marketing
buzzwords and manipulations,
‘data lakehouse’, ‘data lake’,
and ‘data warehouse’ are all
there to solve the same data
integration and insight
generation problems.
The implementation will
continue to evolve as the new
hardware and software
become viable and practical.
● ACID
● Mutable (Delete, Update, Compact)
● Schema (DDL and Evolution)
● Metadata (Rich, Performant)
● Open (Format, API, Tooling, Adoption)
● Fast (Optimized for Various Patterns)
● Extensible (User-defined ***, Federation)
● Intuitive (Data-centric Operation/Language)
● Productive (Achieve more with less)
● Practical (Join, Aggregate, Cache, View)
In Common

Solution Architecture Template
Sources
Ads
BI/OLAP
Machine Learning
Deep Learning
Observability
Recommendation
A/B Test
Storage
Data Format and SerDe
Metadata Catalog and Table API
Unified Data Interface
CDC
Ingestion
T+0 or T+0.000694
T+0.0416 or T+1
...

Data Analytics in Cloud Storage
● Object Store File System
○ There is no hierarchy semantics to rename or inherit
○ Object is not appendable (in general)
○ Metadata is limited to a few KB
● REST is easy to program but RPC is much faster
○ Job/query planning step needs a lot of small scans (it is chatty)
○ 4MB cache block size may be inefficient for metadata operations
● Hadoop stack is tightly-coupled with HDFS notions
○ Hive and Spark (originally) were not optimized for object stores
○ Running HDFS as a cache/intermediate layer on a VM fleet can be
useful yet suboptimal (and operational heavy)
○ Data locally still matters for SLA-sensitive batch jobs
Is Not

Big Data becomes too big, even Metadata
● Computation cost keep rising for big data
○ Partitioning the files by date is not enough
○ Hot and warm data sizes are still very big (how to save $$$)
○ Analytics often scan big data files but discard 90% records and 80%
fields. The CPU, memory, network and I/O cost is billed for 100%
○ Columnar format has skipping index and projection pushdown, but
how to fetch them swiftly
● Hive Metadata only manages directory (HIVE-9452 abandoned)
○ Commits can happen at file or file group level (instead of directory)
○ High-performance engines need better file layout and rich metadata at
field level for each segment/chunk in a file
○ Process metadata via Java ORM?

Immutable or Mutable
● Big data is all about immutable schemaless data
○ To get useful insights and features out the raw data, we still have to
dedupe, transform, conform, merge, aggregate, and backfill
○ Schema evolution happens frequently when merge & backfill occurs
● Storage is infinite and compute is cheap
○ Why not rewriting the entire data file or directory all the time
○ If it is slow, increase the number of partitions and executors
● Streaming and Batch Unification requires a decent incremental logic
○ Store granularly with ACID isolation and clear watermarks
○ Process incrementally without partial reads or duplicates
○ Evolve reliably with enough flexibility

Are All Open Standards Equal?
● Hive 3.x
○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor)
○ Streaming Ingestion API, LLAP (daemon, caching, faster execution)
● Iceberg
○ Flexible Field Schema and Partition Layout Evolution (S3-first)
○ Hidden Partition (expression-based) and Bucket Transformation
● Delta Lake
○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2
○ Fully supported in SparkSQL, PySpark and Delta Engine
● Hudi
○ Optimized UPSERT with indexing (record key, file id, partition path)
○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)

Why Iceberg is so cool?
● Netflix is the most advanced AWS flagship partner
○ S3 is very scalable but a little bit over-simplified
○ Solve the critical cloud storage problems:
■ Avoid rename
■ Avoid directory hierarchy and naming convention
■ Aggregate (index) metadata into a compacted (manifest) file
● Netflix has migrated to Flink for stream processing
○ Fast ETL/analytics are needed to respond to its non-stop VOD
○ w/ One of the biggest Cassandra cluster (less mutable headache)
○ No urgent need for DML yet
● Netflix uses multiple data platforms/engines, and migrates faster than ...
○ Support other file formats, engines, schema, bucketing by nature

Why Delta Lake is so handy?
● If you love to use Spark for ETL (Steaming & Batch), Delta
Lake just makes it so much more powerful
○ The API and SQL syntax are so easy to use (especially for data folks)
○ Wide range of patterns provided by paid customers and OSS community
○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds
● Databricks has full control and moves very fast
○ v0.2 (cloud storage support: June 2019)
○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019)
○ v0.5 (DML & compaction performance, Presto integration: Dec 2019)
○ v0.6 (Schema evolution during merge, read by path: Apr 2020)
○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)

Why Hudi is faster?
● Uber is a true fast-data company
○ Their marketing place and supply-demand-matching business model
seriously depends on near real-time analytics:
■ Directly upsert MySql BIN log to Hudi table
■ Frequently bulk dump Cassandra is obviously infeasible
■ record_key is indexed (file names + bloom filters) to speed up
■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read
■ Snapshot query is faster, while Incremental query has low latency
● Uber is also committed to Flink
● Uber mainly builds its own data centers and HDFS clusters
○ So Hudi is mainly optimized for on-prem HDFS with Hive convention
○ GCP and AWS support was added later

Code Snippets - Delta
spark.readStream.format("delta").load("/path/to/delta/events")
deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
# Upsert (merge) new data
newData = spark.range(0, 20)
deltaTable.alias("oldData")
.merge(
newData.alias("newData"),
"oldData.id = newData.id")
.whenMatchedUpdate(set = { "id": col("newData.id") })
.whenNotMatchedInsert(values = { "id": col("newData.id") })
.execute()
val df = spark.read.format(“delta”).load("/path/to/my/table@v5238")
// ---- Spark SQL ----
SELECT * FROM events -- query table in the metastore
SELECT * FROM delta.`/delta/events` -- query table by path
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000"
SELECT count(*) FROM my_table VERSION AS OF 5238
UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'

Code Snippets - Hudi
val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
// load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
// since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from
hudi_trips_snapshot").show()
// -------------------
val beginTime = "000" // Represents all commits > this time.
val endTime = commits(commits.length - 2) // point in time to query
// incrementally query data
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare
> 20.0").show()

Code Snippets - Iceberg
CREATE TABLE prod.db.sample_table (
id bigint,
data string,
category string,
ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)
SELECT * FROM prod.db.sample_table.files
INSERT OVERWRITE prod.my_app.logs
SELECT uuid, first(level), first(ts), first(message)
FROM prod.my_app.logs
WHERE cast(ts as date) = '2020-07-01'
GROUP BY uuid
spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table")
// time travel to snapshot with ID 10963874102873L
spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")

Time Travel
● Time Travel is focused on keeping both Batch and Streaming
jobs isolated from the Concurrent Reads & Writes
● Typical Range for Time Travel is 7 ~ 30 days
● Machine Learning (Feature reGeneration) often needs to
travel to 3~24 months back
○ Need to reduce the precision/granularity of commits kept
in Data Lake (compact the logs to daily or monthly level)
■ Monthly baseline/snapshot + daily delta/changes
○ Consider a more advanced SCD-2 data model for ML

What Else Should be Part of Data Lake?
● Catalog (next-generation metastore alternatives)
○ Daemon service: scalable, easy to update and query
○ Federation across data centers (across cloud and on-premises)
● Better file format and in-memory columnar format
○ Less SerDe overhead, zero-copy, directly vectorized operation on
compressed data (Artus-like). Tungsten v2 (Arrow-like)
● Performance and Data Management (for OLAP and AI)
○ New compute engines (non-JVM based) with smart caching and pre-
aggregation & materialized view
○ Mechanism to enable Time Travel with more flexible and wider range
○ Rich DSL with code generation and pushdown capability for faster AI
training and inference

How to
What are the pain points?
Each Data Lake framework has
its own emphasis, please find
the alignment of your pain
points accordingly.
● Motivations
Smoother integration with existing development
language and compute engine?
Contribute to the framework to solve new problems?
Have more control of the infrastructure, is the
framework’s open source governance friendly?
● Restrictions
...
Choose?

⧫ Delta Lake + Spark + Delta Engine +
Python support will effectively help
Databricks pull ahead in the race.
⧫ Flink community is all in for Iceberg.
⧫ GCP BigQuery, EMR, and Azure Synapse
(will) support reading from all table
formats, so you can lift-and-shift to ...

What’s next?
Data Lake can do more
Can be faster
Can be easier

Additional Readings
● Gartner Research
○ Are You Shifting Your Problems to the Cloud or Solving Them?
○ Demystifying Cloud Data Warehouse Characteristics
● Google
○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw)
○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor)
● Uber
○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental)
● Alibaba
○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf)
○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink)
○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)

Data Lake implementations are still
evolving, don’t hold your breath for
the single best choice. Roll up
sleeves and build practical solutions
with 2 or 3 options combined.
Computation engine gravity/bias
will directly reshape the waterscape.

Thanks!
Presentation URL:
https://bit.ly/SFBA0728
Blog:
http://bit.ly/iceberg-delta-hudi-hive

Reshape Data Lake (as of 2020.07)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Reshape Data Lake (as of 2020.07)

Similar to Reshape Data Lake (as of 2020.07) (20)

Recently uploaded

Recently uploaded (20)

Reshape Data Lake (as of 2020.07)

Editor's Notes