iceberg introduction.pptx

Iceberg – introduction
Big Data 2.0
Dori Waldman

Who support iceberg spec?
• Popular vendors implement Iceberg spec, better to verify what and
how they support it
• For example, Athena has vacuum operation to clean up
• Some vendor provide automation tools to handle cleanup
• This PPT will focus on spark and Iceberg

Why do we need to care about it ?
• we have a bucket in s3, partition by day and account_Id with parquet
files, each file has millions of rows.
• I even do hourly --> daily --> weekly aggregation
• Parquet is columnar format (binary , read less, compression, optimize
for big data queries)

Partition ("hive partition") is a good
start
• Reduce full table scan (s3 bucket- directory list) - search in specific path
• Atomic updates (overwrite all partition)
• Not works well with delta changes (can't update one file in partition)
• Reading from partition during overwrite - return nonstable data
• Orphan files – if you have and you are not aware - return nonstable data
• Might require to add custom column (Day/Hour from ts) for partition

But I want more power
• I want to be able to query the data faster (in addition to partition)
• I want Acid operation - if user read from s3 during write – he will get stable data ("Versions")
• I want to update or delete one row without overwrite partition (‘Delta’).
• I don’t want to create new columns day/hour (for partition) from the event timestamp
• I want to be able to change the schema (add column , rename column) without breaking
anything
• I have new requirement: repartition by month but I don’t want to change old data
structure.

We can add Bucket - "hush for pre shuffle"

Why bucket improve join queries ...

Bucket
https://towardsdatascienc
e.com/best-practices-for-
bucketing-in-spark-sql-
ea9f23f7dd53

Bucket with
partition
Iceberg – can partition by bucket

What is Iceberg ? Table format
• Its metadata about the ‘Parquet’ file – parquet data is no longer a black
box for the engines.
• Metadata is used to allow new capabilities

Read flow
1. From catalog get current snapshot version
2. From snapshot get list of metadata files
3. Read metadata files and filter according to partition (and hidden partition),
statistics
4. Read relevant files instead of list files in partitions
• Let's say I want to find all events where user is “Jon”, and I only partition by day,
account_id and I added bucket by country
• without iceberg its full table scan (Instead of all s3 bucket I can reduce to specific
day/account/country but that is all)
• with iceberg I can also reduce number of files that will be read (explained later how)

Example – create catalog “local”
// use packages or add iceberg to spark jars
// same for spark-sql
./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.local.type=hadoop
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
--conf spark.sql.defaultCatalog=local
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Create new table
scala> spark.sql(”
CREATE TABLE local.db.demoTbl2(id int , name string, ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts))
")
• This example missing “orderBy”

Insert data
scala> spark.sql(”
insert into local.db.demoTbl2 (id, name, ts)
values
(1, "A", cast(date_format('2019-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp)),
(2, "B", cast(date_format('2023-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp))
") ;
scala> val df = spark.table("local.db.demoTbl2");
scala> df.show();
+---+----+--------------------+
| id|name| ts|
+---+----+--------------------+
| 1| A|2019-06-13 13:22:...|
| 2| B|2023-06-13 13:22:...|
+---+----+--------------------+

Other options to insert data – “merge into”
https://iceberg.apache.org/docs/latest/spark-writes/#merge-into

Update / delete
scala> spark.sql("update local.db.demoTbl2 set name='D' where id=2");
scala> val df = spark.table("local.db.demoTbl2");
scala> df.show();
+---+----+--------------------+
| id|name| ts|
+---+----+--------------------+
| 2| D|2023-06-13 13:22:...|
| 1| A|2019-06-13 13:22:...|
+---+----+--------------------+

COW vs MOR
work is done during write
in order to have fast read
MOR has 2 flavor :
• By position
• By parameter

COW
• With Cow – new parquet will be created with new data, not diff changes
Old parquet file
New parquet file with all data not only diff

MOR
./spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4
New parquet file with
changes
New avro file which mark what needs to be removed (using position approach)
New parquet file with diff only
Old parquet file
Diff metadata

Snapshot – Time travel – versions
(GDPR…)
scala> spark.table("local.db.demoTbl2.snapshots").show();
scala> spark.sql("select * from local.db.demoTbl2 VERSION AS OF 5581909029793327227").show();
+---+----+--------------------+
| id|name| ts|
+---+----+--------------------+
| 1| A|2019-06-13 13:22:...|
| 2| B|2023-06-13 13:22:...|
+---+----+--------------------+

Statistics
Statistics help iceberg to filter files as it save per file max/min values for column...
if data is sorted it will provide better results

Sort: Global vs Local
global sort requires repartitioning the data, so the entire dataset will be shuffled
It's not
about sort
data in s3
It's about
how to insert
new data to
partitions

Order – how data is written by spark (shuffle)
scala> spark.sql("ALTER TABLE local.db.demoTbl2 WRITE ORDERED BY name,ts")
https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--write-ordered-by
https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables

Order
https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables

Maintainance
• Each change create new metadata.json file , snapshot , manifest
• Recommended to merge metadata files, clean old files
Json files

Maintenance - Compaction by rewrite (per partition)
• Recommended to use compaction before clean expired snapshot when
use MOR

Compaction with Sort – rewrite per partition
scala> spark.sql("CALL local.system.rewrite_data_files(table => 'db.demoTbl2', strategy => 'sort', sort_order => 'zorder(name,ts)’)”)
rewrite_data_files - combine small files into larger files to reduce metadata overhead and runtime file open cost.
RewriteManifests -
There are options which control minimum files to use during rewrite
https://iceberg.apache.org/docs/latest/spark-procedures/
https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#field.summary
Partition is sorted
Table is not
sorted

Order vs rewrite
• Using order by – add new files in the required order
• Using rewrite to handle old data with sort strategy (and aggregation
for hourly / daily / weekly …) – compress partitions + data per
partition will be sorted
• Advantage of rewrite with sort and statistics on name column:
• search for all user where name = 'Jon'  iceberg will scan small number of
parquet files per partition* as it can filter out files which has no names with
"J"

Reduce small files from the first place
• write.distribution-mode
• https://www.youtube.com/watch?v=4bOCDP-rhuM

Spark job execution – recap
A single computation unit performed on a single data partition is called a task. It is computed on a single core of the worker node.
wide Transformation needs data shuffling between nodes which means data needs to be exchanged between nodes over the network (shuffle)
wide transformation marks the end the stage and next stage starts
https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/
https://www.linkedin.com/pulse/demystifying-spark-jobs-stages-data-shuffling-shahzad-aslam/

Shuffle the data by partition id (good if the partitions are evenly distributed),
Range partitioning (good to mitigate data skew)
Nothing (good for few partitions, otherwise may lead to small files problem)
Spark write data modes
https://medium.com/@gh
oshsiddharth25/partitionin
g-vs-bucketing-in-apache-
spark-a37b342082e4

• The impact of setting table with "ORDERED BY" --> will set write.distribution-mode: range
• The impact of setting table with "LOCALLY ORDERED BY" --> will set write.distribution-
mode: none
• https://stackoverflow.com/questions/74951477/avoid-shuffling-when-inserting-
into-sorted-iceberg-table
• df = spark.createDataFrame([(i, i*4) for i in range(100000)], ["a", "b"]).coalesce(1).sortWithinPartitions("a", "b")
df.writeTo("datalakelocal.ixanezis.table").append()
• If you prefer to avoid shuffle during inset 'sorted' data - set order within each task, not across
tasks, use LOCALLY ORDERED BY
Spark write data modes

Main features
• Change at file level not partition
• Atomic read/write (read stable data)
• Faster planning and execution
• Hidden partition
• Schema evaluation
• Time travel
• Compaction and cleanup of old data

Big data 2.0 definition
• When you need to handle massive amount of data
• When you need to handle massive amount of metadata

Resources
• https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/
• https://www.youtube.com/watch?v=4bOCDP-rhuM
• https://www.youtube.com/watch?v=CyhdGnqIf9o
• https://iceberg.apache.org/docs/latest/maintenance/
• https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/
• https://iceberg.apache.org/docs/latest/evolution/
• https://iceberg.apache.org/docs/latest/configuration/
• https://iceberg.apache.org/docs/latest/spark-writes/#Spark
• https://www.dremio.com/blog/how-z-ordering-in-apache-iceberg-helps-improve-performance/
• https://towardsdatascience.com/boost-your-cloud-data-applications-with-duckdb-and-iceberg-api-67677666fbd3
• https://medium.com/snowflake/understanding-iceberg-table-metadata-b1209fbcc7c3
• https://medium.com/snowflake/how-apache-iceberg-enables-acid-compliance-for-data-lakes-9069ae783b60
• https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/
• https://www.dremio.com/resources/tutorials/getting-started-with-apache-iceberg-using-aws-glue-and-dremio/
• https://www.matano.dev/blog/2022/11/04/automated-iceberg-table-maintenance
• https://www.youtube.com/watch?v=ofRoRJuirFg
• https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html
• https://www.dremio.com/resources/webinars/why-and-how-netflix-created-and-migrated-to-a-new-table-format-iceberg/
• https://medium.com/getindata-blog/apache-spark-with-apache-iceberg-a-way-to-boost-your-data-pipeline-performance-and-safety-6f87364962a1
• https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53
• https://developer.hpe.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2x/
• https://senthilnayagan.com/apache-spark/2022/spark-bucketing-and-partitions
• https://medium.com/@ghoshsiddharth25/partitioning-vs-bucketing-in-apache-spark-a37b342082e4
• https://medium.com/nerd-for-tech/apache-spark-bucketing-and-partitioning-8feab85d5136
• https://towardsdatascience.com/about-sort-in-spark-3-x-f3699cc31008
• https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce
• https://www.linkedin.com/pulse/demystifying-spark-jobs-stages-data-shuffling-shahzad-aslam/

If our queries only needs
filter by day and account
If we don’t have updates
on the data
If we don’t need versions
of the data
….
we can continue without
Iceberge
But If we need
Iceberg can provide this
capabilities

iceberg introduction.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to iceberg introduction.pptx

Similar to iceberg introduction.pptx (20)

More from Dori Waldman

More from Dori Waldman (11)

Recently uploaded

Recently uploaded (20)

iceberg introduction.pptx