2. Who support iceberg spec?
• Popular vendors implement Iceberg spec, better to verify what and
how they support it
• For example, Athena has vacuum operation to clean up
• Some vendor provide automation tools to handle cleanup
• This PPT will focus on spark and Iceberg
3. Why do we need to care about it ?
• we have a bucket in s3, partition by day and account_Id with parquet
files, each file has millions of rows.
• I even do hourly --> daily --> weekly aggregation
• Parquet is columnar format (binary , read less, compression, optimize
for big data queries)
4. Partition ("hive partition") is a good
start
• Reduce full table scan (s3 bucket- directory list) - search in specific path
• Atomic updates (overwrite all partition)
• Not works well with delta changes (can't update one file in partition)
• Reading from partition during overwrite - return nonstable data
• Orphan files – if you have and you are not aware - return nonstable data
• Might require to add custom column (Day/Hour from ts) for partition
5. But I want more power
• I want to be able to query the data faster (in addition to partition)
• I want Acid operation - if user read from s3 during write – he will get stable data ("Versions")
• I want to update or delete one row without overwrite partition (‘Delta’).
• I don’t want to create new columns day/hour (for partition) from the event timestamp
• I want to be able to change the schema (add column , rename column) without breaking
anything
• I have new requirement: repartition by month but I don’t want to change old data
structure.
11. What is Iceberg ? Table format
• Its metadata about the ‘Parquet’ file – parquet data is no longer a black
box for the engines.
• Metadata is used to allow new capabilities
16. Read flow
1. From catalog get current snapshot version
2. From snapshot get list of metadata files
3. Read metadata files and filter according to partition (and hidden partition),
statistics
4. Read relevant files instead of list files in partitions
• Let's say I want to find all events where user is “Jon”, and I only partition by day,
account_id and I added bucket by country
• without iceberg its full table scan (Instead of all s3 bucket I can reduce to specific
day/account/country but that is all)
• with iceberg I can also reduce number of files that will be read (explained later how)
17.
18.
19.
20. Example – create catalog “local”
// use packages or add iceberg to spark jars
// same for spark-sql
./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.local.type=hadoop
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
--conf spark.sql.defaultCatalog=local
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
21. Create new table
scala> spark.sql(”
CREATE TABLE local.db.demoTbl2(id int , name string, ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts))
")
• This example missing “orderBy”
29. Snapshot – Time travel – versions
(GDPR…)
scala> spark.table("local.db.demoTbl2.snapshots").show();
scala> spark.sql("select * from local.db.demoTbl2 VERSION AS OF 5581909029793327227").show();
+---+----+--------------------+
| id|name| ts|
+---+----+--------------------+
| 1| A|2019-06-13 13:22:...|
| 2| B|2023-06-13 13:22:...|
+---+----+--------------------+
30. Statistics
Statistics help iceberg to filter files as it save per file max/min values for column...
if data is sorted it will provide better results
32. Sort: Global vs Local
global sort requires repartitioning the data, so the entire dataset will be shuffled
It's not
about sort
data in s3
It's about
how to insert
new data to
partitions
33. Order – how data is written by spark (shuffle)
scala> spark.sql("ALTER TABLE local.db.demoTbl2 WRITE ORDERED BY name,ts")
https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--write-ordered-by
https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables
40. Maintenance - Compaction by rewrite (per partition)
• Recommended to use compaction before clean expired snapshot when
use MOR
41. Compaction with Sort – rewrite per partition
scala> spark.sql("CALL local.system.rewrite_data_files(table => 'db.demoTbl2', strategy => 'sort', sort_order => 'zorder(name,ts)’)”)
rewrite_data_files - combine small files into larger files to reduce metadata overhead and runtime file open cost.
RewriteManifests -
There are options which control minimum files to use during rewrite
https://iceberg.apache.org/docs/latest/spark-procedures/
https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#field.summary
Partition is sorted
Table is not
sorted
42. Order vs rewrite
• Using order by – add new files in the required order
• Using rewrite to handle old data with sort strategy (and aggregation
for hourly / daily / weekly …) – compress partitions + data per
partition will be sorted
• Advantage of rewrite with sort and statistics on name column:
• search for all user where name = 'Jon' iceberg will scan small number of
parquet files per partition* as it can filter out files which has no names with
"J"
43. Reduce small files from the first place
• write.distribution-mode
• https://www.youtube.com/watch?v=4bOCDP-rhuM
44. Spark job execution – recap
A single computation unit performed on a single data partition is called a task. It is computed on a single core of the worker node.
wide Transformation needs data shuffling between nodes which means data needs to be exchanged between nodes over the network (shuffle)
wide transformation marks the end the stage and next stage starts
https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/
https://www.linkedin.com/pulse/demystifying-spark-jobs-stages-data-shuffling-shahzad-aslam/
45. Shuffle the data by partition id (good if the partitions are evenly distributed),
Range partitioning (good to mitigate data skew)
Nothing (good for few partitions, otherwise may lead to small files problem)
Spark write data modes
https://medium.com/@gh
oshsiddharth25/partitionin
g-vs-bucketing-in-apache-
spark-a37b342082e4
46. • The impact of setting table with "ORDERED BY" --> will set write.distribution-mode: range
• The impact of setting table with "LOCALLY ORDERED BY" --> will set write.distribution-
mode: none
• https://stackoverflow.com/questions/74951477/avoid-shuffling-when-inserting-
into-sorted-iceberg-table
• df = spark.createDataFrame([(i, i*4) for i in range(100000)], ["a", "b"]).coalesce(1).sortWithinPartitions("a", "b")
df.writeTo("datalakelocal.ixanezis.table").append()
• If you prefer to avoid shuffle during inset 'sorted' data - set order within each task, not across
tasks, use LOCALLY ORDERED BY
Spark write data modes
49. Main features
• Change at file level not partition
• Atomic read/write (read stable data)
• Faster planning and execution
• Hidden partition
• Schema evaluation
• Time travel
• Compaction and cleanup of old data
53. If our queries only needs
filter by day and account
If we don’t have updates
on the data
If we don’t need versions
of the data
….
we can continue without
Iceberge
But If we need
Iceberg can provide this
capabilities