SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
3. Vocabulary & Jargon
● T+1: event/transaction time plus 1 day - typical daily-batch
T+0: realtime process which can deliver insight with minimal delay
T+0.000694: mintely-batch; T+0.041666: hourly-batch
● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva)
● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering
● DML: Insert + Delete + Update + Upsert/Merge
● Time Travel: isolate & preserver multiple snapshot versions
● SCD-2: type 2 of multi-versioned data model to provide time travel
● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL
● Streaming & Batch Unification: union historical bounded data with
continuous stream; interactively query both anytime
4. Data Warehouse Data Lake v1 Data Lake v2
Relational DB based MPP
ETL done by IT team
ELT inside MPP
Star schema
OLAP and BI focused
SQL is the main DSL
ODBC + JDBC as ⇿ interface
<Expensive to scale …>
Limited UD*F to run R and Data
Mining inside database
HDFS + NoSQL
ETL done by Java folks
Nested schema or no schema
Hive used by non-engineers
Export data back to RDBMS
for OLAP/BI
M/R API & DSL dominated
Scalable ML became possible
<Hard to operate …>
UD*F & SerDe made easier
Cloud + HTAP/MPP + NoSQL
ETL done by data people in
Spark and Presto
Data model and schema matter
again
Streaming + Batch ⇨ unified
More expressed in SQL + Python
ML as a critical use case
<Too confused to migrate…>
Non-JVM engines emerge
5. Share So Much
Despite of all the marketing
buzzwords and manipulations,
‘data lakehouse’, ‘data lake’,
and ‘data warehouse’ are all
there to solve the same data
integration and insight
generation problems.
The implementation will
continue to evolve as the new
hardware and software
become viable and practical.
● ACID
● Mutable (Delete, Update, Compact)
● Schema (DDL and Evolution)
● Metadata (Rich, Performant)
● Open (Format, API, Tooling, Adoption)
● Fast (Optimized for Various Patterns)
● Extensible (User-defined ***, Federation)
● Intuitive (Data-centric Operation/Language)
● Productive (Achieve more with less)
● Practical (Join, Aggregate, Cache, View)
In Common
6. Solution Architecture Template
Sources
Ads
BI/OLAP
Machine Learning
Deep Learning
Observability
Recommendation
A/B Test
Storage
Data Format and SerDe
Metadata Catalog and Table API
Unified Data Interface
CDC
Ingestion
T+0 or T+0.000694
T+0.0416 or T+1
...
7. Data Analytics in Cloud Storage
● Object Store File System
○ There is no hierarchy semantics to rename or inherit
○ Object is not appendable (in general)
○ Metadata is limited to a few KB
● REST is easy to program but RPC is much faster
○ Job/query planning step needs a lot of small scans (it is chatty)
○ 4MB cache block size may be inefficient for metadata operations
● Hadoop stack is tightly-coupled with HDFS notions
○ Hive and Spark (originally) were not optimized for object stores
○ Running HDFS as a cache/intermediate layer on a VM fleet can be
useful yet suboptimal (and operational heavy)
○ Data locally still matters for SLA-sensitive batch jobs
Is Not
8. Big Data becomes too big, even Metadata
● Computation cost keep rising for big data
○ Partitioning the files by date is not enough
○ Hot and warm data sizes are still very big (how to save $$$)
○ Analytics often scan big data files but discard 90% records and 80%
fields. The CPU, memory, network and I/O cost is billed for 100%
○ Columnar format has skipping index and projection pushdown, but
how to fetch them swiftly
● Hive Metadata only manages directory (HIVE-9452 abandoned)
○ Commits can happen at file or file group level (instead of directory)
○ High-performance engines need better file layout and rich metadata at
field level for each segment/chunk in a file
○ Process metadata via Java ORM?
9. Immutable or Mutable
● Big data is all about immutable schemaless data
○ To get useful insights and features out the raw data, we still have to
dedupe, transform, conform, merge, aggregate, and backfill
○ Schema evolution happens frequently when merge & backfill occurs
● Storage is infinite and compute is cheap
○ Why not rewriting the entire data file or directory all the time
○ If it is slow, increase the number of partitions and executors
● Streaming and Batch Unification requires a decent incremental logic
○ Store granularly with ACID isolation and clear watermarks
○ Process incrementally without partial reads or duplicates
○ Evolve reliably with enough flexibility
10. Are All Open Standards Equal?
● Hive 3.x
○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor)
○ Streaming Ingestion API, LLAP (daemon, caching, faster execution)
● Iceberg
○ Flexible Field Schema and Partition Layout Evolution (S3-first)
○ Hidden Partition (expression-based) and Bucket Transformation
● Delta Lake
○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2
○ Fully supported in SparkSQL, PySpark and Delta Engine
● Hudi
○ Optimized UPSERT with indexing (record key, file id, partition path)
○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)
11. Why Iceberg is so cool?
● Netflix is the most advanced AWS flagship partner
○ S3 is very scalable but a little bit over-simplified
○ Solve the critical cloud storage problems:
■ Avoid rename
■ Avoid directory hierarchy and naming convention
■ Aggregate (index) metadata into a compacted (manifest) file
● Netflix has migrated to Flink for stream processing
○ Fast ETL/analytics are needed to respond to its non-stop VOD
○ w/ One of the biggest Cassandra cluster (less mutable headache)
○ No urgent need for DML yet
● Netflix uses multiple data platforms/engines, and migrates faster than ...
○ Support other file formats, engines, schema, bucketing by nature
12. Why Delta Lake is so handy?
● If you love to use Spark for ETL (Steaming & Batch), Delta
Lake just makes it so much more powerful
○ The API and SQL syntax are so easy to use (especially for data folks)
○ Wide range of patterns provided by paid customers and OSS community
○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds
● Databricks has full control and moves very fast
○ v0.2 (cloud storage support: June 2019)
○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019)
○ v0.5 (DML & compaction performance, Presto integration: Dec 2019)
○ v0.6 (Schema evolution during merge, read by path: Apr 2020)
○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)
13. Why Hudi is faster?
● Uber is a true fast-data company
○ Their marketing place and supply-demand-matching business model
seriously depends on near real-time analytics:
■ Directly upsert MySql BIN log to Hudi table
■ Frequently bulk dump Cassandra is obviously infeasible
■ record_key is indexed (file names + bloom filters) to speed up
■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read
■ Snapshot query is faster, while Incremental query has low latency
● Uber is also committed to Flink
● Uber mainly builds its own data centers and HDFS clusters
○ So Hudi is mainly optimized for on-prem HDFS with Hive convention
○ GCP and AWS support was added later
14. Code Snippets - Delta
spark.readStream.format("delta").load("/path/to/delta/events")
deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
# Upsert (merge) new data
newData = spark.range(0, 20)
deltaTable.alias("oldData")
.merge(
newData.alias("newData"),
"oldData.id = newData.id")
.whenMatchedUpdate(set = { "id": col("newData.id") })
.whenNotMatchedInsert(values = { "id": col("newData.id") })
.execute()
val df = spark.read.format(“delta”).load("/path/to/my/table@v5238")
// ---- Spark SQL ----
SELECT * FROM events -- query table in the metastore
SELECT * FROM delta.`/delta/events` -- query table by path
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000"
SELECT count(*) FROM my_table VERSION AS OF 5238
UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'
15. Code Snippets - Hudi
val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
// load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
// since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from
hudi_trips_snapshot").show()
// -------------------
val beginTime = "000" // Represents all commits > this time.
val endTime = commits(commits.length - 2) // point in time to query
// incrementally query data
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare
> 20.0").show()
16. Code Snippets - Iceberg
CREATE TABLE prod.db.sample_table (
id bigint,
data string,
category string,
ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)
SELECT * FROM prod.db.sample_table.files
INSERT OVERWRITE prod.my_app.logs
SELECT uuid, first(level), first(ts), first(message)
FROM prod.my_app.logs
WHERE cast(ts as date) = '2020-07-01'
GROUP BY uuid
spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table")
// time travel to snapshot with ID 10963874102873L
spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")
17. Time Travel
● Time Travel is focused on keeping both Batch and Streaming
jobs isolated from the Concurrent Reads & Writes
● Typical Range for Time Travel is 7 ~ 30 days
● Machine Learning (Feature reGeneration) often needs to
travel to 3~24 months back
○ Need to reduce the precision/granularity of commits kept
in Data Lake (compact the logs to daily or monthly level)
■ Monthly baseline/snapshot + daily delta/changes
○ Consider a more advanced SCD-2 data model for ML
18. What Else Should be Part of Data Lake?
● Catalog (next-generation metastore alternatives)
○ Daemon service: scalable, easy to update and query
○ Federation across data centers (across cloud and on-premises)
● Better file format and in-memory columnar format
○ Less SerDe overhead, zero-copy, directly vectorized operation on
compressed data (Artus-like). Tungsten v2 (Arrow-like)
● Performance and Data Management (for OLAP and AI)
○ New compute engines (non-JVM based) with smart caching and pre-
aggregation & materialized view
○ Mechanism to enable Time Travel with more flexible and wider range
○ Rich DSL with code generation and pushdown capability for faster AI
training and inference
19. How to
What are the pain points?
Each Data Lake framework has
its own emphasis, please find
the alignment of your pain
points accordingly.
● Motivations
Smoother integration with existing development
language and compute engine?
Contribute to the framework to solve new problems?
Have more control of the infrastructure, is the
framework’s open source governance friendly?
● Restrictions
...
Choose?
20. ⧫ Delta Lake + Spark + Delta Engine +
Python support will effectively help
Databricks pull ahead in the race.
⧫ Flink community is all in for Iceberg.
⧫ GCP BigQuery, EMR, and Azure Synapse
(will) support reading from all table
formats, so you can lift-and-shift to ...
22. Additional Readings
● Gartner Research
○ Are You Shifting Your Problems to the Cloud or Solving Them?
○ Demystifying Cloud Data Warehouse Characteristics
● Google
○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw)
○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor)
● Uber
○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental)
● Alibaba
○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf)
○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink)
○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)
23.
24. Data Lake implementations are still
evolving, don’t hold your breath for
the single best choice. Roll up
sleeves and build practical solutions
with 2 or 3 options combined.
Computation engine gravity/bias
will directly reshape the waterscape.
The views expressed in this presentation are those of the author and do not reflect any policy or position of the employers of the author.
IA = Infrequent Access; NL = Near Line; CL = Code Line;
https://flink.apache.org/news/2019/02/13/unified-batch-streaming-blink.html
During v1 time, there are several attempts for non-JVM engines, but none of them have really thrived. GPU, C++ and LLVM are really changing the game of Deep Learning and OLAP.HDFS are reaching it peak time and it starts fading away.
if all you have is a hammer, everything looks like a nail
The Druid/Pinot (near real time analytics) block can be merged into the Data Lake with T+0 ingestion and processing capability. It can also be replaced by HTAP (such as TiDB) as a super ODS.
AWS EFS is really a NFS/NAS solution, so it can’t even replace HDFS on S3. Use EmrFileSystem instead. And s3a:// has https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/s3-limitations.htmlAzure Data Lake Storage Gen2 is almost capable of replacing HDFS. abfs://Google Colossus is years ahead of OSS, a true distributed file system.
HIVE-14269, HIVE-14270, HIVE-20517, HADOOP-15364, HADOOP-15281
Hive ACID is not allowed if S3 is the storage layer (Hudi or others can be used as SerDe)
Snowflake uses FoundationDB to organize a lot of metadata to speed up its Query Processing.
https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/S3 Select was launched Apr 2018 to provide some pushdown (Sep 2018 for Parquet) (Nov 2018, output committer to avoid rename)
Record-grain mutable is expensive, but how about min-batch level?
GDPR, CCPA, IDPC and … affect offline big data as well.
Iceberg is mainly optimized for Parquet, but its spec and API are open to support ORC and Avro too.
The Bucket Transformation is designed to work across Hive, Spark, Presto and Flink.
Clearly distinguish and handle processing_time (a.k.a. arrival_time) vs. event_time (a.k.a. payload_time or transaction_time)
In short, Hudi can efficiently update/reconcile the late-arrival records to the proper partition.
https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
Anecdote: Huawei was donating CarbonData into open source Spark a few years ago, but maybe Delta had been the way to go already, CarbonData never made to a file format bundled in Spark.
CarbonData is a more comprehensive columnar format that supports rich indexing and even DML operations at SerDe level. The latest FusionInsights MRS 8.0 is realizing the mutable Data Lake with streaming & batch combined on top of CarbonData.It will not be surprising if some of the Iceberg contributors & adopters have similar worry about Delta Lake.
Huawei CarbonData anecdote:
https://www.qlik.com/us/-/media/files/resource-library/global-us/register/ebooks/eb-cloud-data-warehouse-comparison-ebook-en.pdf
https://www.gartner.com/doc/reprints?id=1-1ZA6E2JU&ct=200619&st=sb (Cloud Data Warehouse: Are You Shifting Your Problems to the Cloud or Solving Them?)
We need to speculate where Databricks is forging forward next? (Data Lake + ETL + ML + OLAP + DL + SaaS/Serverless + Data Management + …)
What shall we learn from Snowflake’s architecture and success? (Data Lake should be fast and intuitive to use, Metadata is so important to optimize the query performance)
Anecdote: Snowflake’s IPO market cap is about 10x bigger than Cloudera, that should tell something about how useful it is.