SlideShare a Scribd company logo
1 of 53
Iceberg – introduction
Big Data 2.0
Dori Waldman
Who support iceberg spec?
• Popular vendors implement Iceberg spec, better to verify what and
how they support it
• For example, Athena has vacuum operation to clean up
• Some vendor provide automation tools to handle cleanup
• This PPT will focus on spark and Iceberg
Why do we need to care about it ?
• we have a bucket in s3, partition by day and account_Id with parquet
files, each file has millions of rows.
• I even do hourly --> daily --> weekly aggregation
• Parquet is columnar format (binary , read less, compression, optimize
for big data queries)
Partition ("hive partition") is a good
start
• Reduce full table scan (s3 bucket- directory list) - search in specific path
• Atomic updates (overwrite all partition)
• Not works well with delta changes (can't update one file in partition)
• Reading from partition during overwrite - return nonstable data
• Orphan files – if you have and you are not aware - return nonstable data
• Might require to add custom column (Day/Hour from ts) for partition
But I want more power
• I want to be able to query the data faster (in addition to partition)
• I want Acid operation - if user read from s3 during write – he will get stable data ("Versions")
• I want to update or delete one row without overwrite partition (‘Delta’).
• I don’t want to create new columns day/hour (for partition) from the event timestamp
• I want to be able to change the schema (add column , rename column) without breaking
anything
• I have new requirement: repartition by month but I don’t want to change old data
structure.
We can add Bucket - "hush for pre shuffle"
Why bucket improve join queries ...
Bucket
https://towardsdatascienc
e.com/best-practices-for-
bucketing-in-spark-sql-
ea9f23f7dd53
Bucket with
partition
Iceberg – can partition by bucket
Bucket vs partition
What is Iceberg ? Table format
• Its metadata about the ‘Parquet’ file – parquet data is no longer a black
box for the engines.
• Metadata is used to allow new capabilities
Files Architecture
Read flow
1. From catalog get current snapshot version
2. From snapshot get list of metadata files
3. Read metadata files and filter according to partition (and hidden partition),
statistics
4. Read relevant files instead of list files in partitions
• Let's say I want to find all events where user is “Jon”, and I only partition by day,
account_id and I added bucket by country
• without iceberg its full table scan (Instead of all s3 bucket I can reduce to specific
day/account/country but that is all)
• with iceberg I can also reduce number of files that will be read (explained later how)
Example – create catalog “local”
// use packages or add iceberg to spark jars
// same for spark-sql
./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0 
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
--conf spark.sql.catalog.spark_catalog.type=hive 
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog 
--conf spark.sql.catalog.local.type=hadoop 
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse 
--conf spark.sql.defaultCatalog=local 
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Create new table
scala> spark.sql(”
CREATE TABLE local.db.demoTbl2(id int , name string, ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts))
")
• This example missing “orderBy”
Insert data
scala> spark.sql(”
insert into local.db.demoTbl2 (id, name, ts)
values
(1, "A", cast(date_format('2019-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp)),
(2, "B", cast(date_format('2023-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp))
") ;
scala> val df = spark.table("local.db.demoTbl2");
scala> df.show();
+---+----+--------------------+
| id|name| ts|
+---+----+--------------------+
| 1| A|2019-06-13 13:22:...|
| 2| B|2023-06-13 13:22:...|
+---+----+--------------------+
Other options to insert data – “merge into”
https://iceberg.apache.org/docs/latest/spark-writes/#merge-into
Update / delete
scala> spark.sql("update local.db.demoTbl2 set name='D' where id=2");
scala> val df = spark.table("local.db.demoTbl2");
scala> df.show();
+---+----+--------------------+
| id|name| ts|
+---+----+--------------------+
| 2| D|2023-06-13 13:22:...|
| 1| A|2019-06-13 13:22:...|
+---+----+--------------------+
COW vs MOR
work is done during write
in order to have fast read
MOR has 2 flavor :
• By position
• By parameter
COW
• With Cow – new parquet will be created with new data, not diff changes
Old parquet file
New parquet file with all data not only diff
MOR
./spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4
New parquet file with
changes
New avro file which mark what needs to be removed (using position approach)
New parquet file with diff only
Old parquet file
Diff metadata
Settings
Snapshot – Time travel – versions
(GDPR…)
scala> spark.table("local.db.demoTbl2.snapshots").show();
scala> spark.sql("select * from local.db.demoTbl2 VERSION AS OF 5581909029793327227").show();
+---+----+--------------------+
| id|name| ts|
+---+----+--------------------+
| 1| A|2019-06-13 13:22:...|
| 2| B|2023-06-13 13:22:...|
+---+----+--------------------+
Statistics
Statistics help iceberg to filter files as it save per file max/min values for column...
if data is sorted it will provide better results
Sort advantage
Sort: Global vs Local
global sort requires repartitioning the data, so the entire dataset will be shuffled
It's not
about sort
data in s3
It's about
how to insert
new data to
partitions
Order – how data is written by spark (shuffle)
scala> spark.sql("ALTER TABLE local.db.demoTbl2 WRITE ORDERED BY name,ts")
https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--write-ordered-by
https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables
Order
https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables
Order by in query result
Maintainance
• Each change create new metadata.json file , snapshot , manifest
• Recommended to merge metadata files, clean old files
Json files
Schedule jobs
Maintenance - Compaction by rewrite (per partition)
• Recommended to use compaction before clean expired snapshot when
use MOR
Compaction with Sort – rewrite per partition
scala> spark.sql("CALL local.system.rewrite_data_files(table => 'db.demoTbl2', strategy => 'sort', sort_order => 'zorder(name,ts)’)”)
rewrite_data_files - combine small files into larger files to reduce metadata overhead and runtime file open cost.
RewriteManifests -
There are options which control minimum files to use during rewrite
https://iceberg.apache.org/docs/latest/spark-procedures/
https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#field.summary
Partition is sorted
Table is not
sorted
Order vs rewrite
• Using order by – add new files in the required order
• Using rewrite to handle old data with sort strategy (and aggregation
for hourly / daily / weekly …) – compress partitions + data per
partition will be sorted
• Advantage of rewrite with sort and statistics on name column:
• search for all user where name = 'Jon'  iceberg will scan small number of
parquet files per partition* as it can filter out files which has no names with
"J"
Reduce small files from the first place
• write.distribution-mode
• https://www.youtube.com/watch?v=4bOCDP-rhuM
Spark job execution – recap
A single computation unit performed on a single data partition is called a task. It is computed on a single core of the worker node.
wide Transformation needs data shuffling between nodes which means data needs to be exchanged between nodes over the network (shuffle)
wide transformation marks the end the stage and next stage starts
https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/
https://www.linkedin.com/pulse/demystifying-spark-jobs-stages-data-shuffling-shahzad-aslam/
Shuffle the data by partition id (good if the partitions are evenly distributed),
Range partitioning (good to mitigate data skew)
Nothing (good for few partitions, otherwise may lead to small files problem)
Spark write data modes
https://medium.com/@gh
oshsiddharth25/partitionin
g-vs-bucketing-in-apache-
spark-a37b342082e4
• The impact of setting table with "ORDERED BY" --> will set write.distribution-mode: range
• The impact of setting table with "LOCALLY ORDERED BY" --> will set write.distribution-
mode: none
• https://stackoverflow.com/questions/74951477/avoid-shuffling-when-inserting-
into-sorted-iceberg-table
• df = spark.createDataFrame([(i, i*4) for i in range(100000)], ["a", "b"]).coalesce(1).sortWithinPartitions("a", "b")
df.writeTo("datalakelocal.ixanezis.table").append()
• If you prefer to avoid shuffle during inset 'sorted' data - set order within each task, not across
tasks, use LOCALLY ORDERED BY
Spark write data modes
Migration options
Migration + change data
Main features
• Change at file level not partition
• Atomic read/write (read stable data)
• Faster planning and execution
• Hidden partition
• Schema evaluation
• Time travel
• Compaction and cleanup of old data
Alternative
Big data 2.0 definition
• When you need to handle massive amount of data
• When you need to handle massive amount of metadata
Resources
• https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/
• https://www.youtube.com/watch?v=4bOCDP-rhuM
• https://www.youtube.com/watch?v=CyhdGnqIf9o
• https://iceberg.apache.org/docs/latest/maintenance/
• https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/
• https://iceberg.apache.org/docs/latest/evolution/
• https://iceberg.apache.org/docs/latest/configuration/
• https://iceberg.apache.org/docs/latest/spark-writes/#Spark
• https://www.dremio.com/blog/how-z-ordering-in-apache-iceberg-helps-improve-performance/
• https://towardsdatascience.com/boost-your-cloud-data-applications-with-duckdb-and-iceberg-api-67677666fbd3
• https://medium.com/snowflake/understanding-iceberg-table-metadata-b1209fbcc7c3
• https://medium.com/snowflake/how-apache-iceberg-enables-acid-compliance-for-data-lakes-9069ae783b60
• https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/
• https://www.dremio.com/resources/tutorials/getting-started-with-apache-iceberg-using-aws-glue-and-dremio/
• https://www.matano.dev/blog/2022/11/04/automated-iceberg-table-maintenance
• https://www.youtube.com/watch?v=ofRoRJuirFg
• https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html
• https://www.dremio.com/resources/webinars/why-and-how-netflix-created-and-migrated-to-a-new-table-format-iceberg/
• https://medium.com/getindata-blog/apache-spark-with-apache-iceberg-a-way-to-boost-your-data-pipeline-performance-and-safety-6f87364962a1
• https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53
• https://developer.hpe.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2x/
• https://senthilnayagan.com/apache-spark/2022/spark-bucketing-and-partitions
• https://medium.com/@ghoshsiddharth25/partitioning-vs-bucketing-in-apache-spark-a37b342082e4
• https://medium.com/nerd-for-tech/apache-spark-bucketing-and-partitioning-8feab85d5136
• https://towardsdatascience.com/about-sort-in-spark-3-x-f3699cc31008
• https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce
• https://www.linkedin.com/pulse/demystifying-spark-jobs-stages-data-shuffling-shahzad-aslam/
If our queries only needs
filter by day and account
If we don’t have updates
on the data
If we don’t need versions
of the data
….
we can continue without
Iceberge
But If we need
Iceberg can provide this
capabilities

More Related Content

What's hot

Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Bitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseBitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseAltinity Ltd
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 

What's hot (20)

Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Bitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseBitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouse
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 

Similar to iceberg introduction.pptx

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSeeQuality.net
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way Dori Waldman
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesRalph Attard
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 

Similar to iceberg introduction.pptx (20)

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling Databases
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 

More from Dori Waldman

Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkmeDori Waldman
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Dori Waldman
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8 Dori Waldman
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Dori waldman android _course_2
Dori waldman android _course_2Dori waldman android _course_2
Dori waldman android _course_2Dori Waldman
 
Dori waldman android _course
Dori waldman android _courseDori waldman android _course
Dori waldman android _courseDori Waldman
 

More from Dori Waldman (11)

openai.pptx
openai.pptxopenai.pptx
openai.pptx
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkme
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Druid
DruidDruid
Druid
 
Memcached
MemcachedMemcached
Memcached
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Dori waldman android _course_2
Dori waldman android _course_2Dori waldman android _course_2
Dori waldman android _course_2
 
Dori waldman android _course
Dori waldman android _courseDori waldman android _course
Dori waldman android _course
 

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 

iceberg introduction.pptx

  • 1. Iceberg – introduction Big Data 2.0 Dori Waldman
  • 2. Who support iceberg spec? • Popular vendors implement Iceberg spec, better to verify what and how they support it • For example, Athena has vacuum operation to clean up • Some vendor provide automation tools to handle cleanup • This PPT will focus on spark and Iceberg
  • 3. Why do we need to care about it ? • we have a bucket in s3, partition by day and account_Id with parquet files, each file has millions of rows. • I even do hourly --> daily --> weekly aggregation • Parquet is columnar format (binary , read less, compression, optimize for big data queries)
  • 4. Partition ("hive partition") is a good start • Reduce full table scan (s3 bucket- directory list) - search in specific path • Atomic updates (overwrite all partition) • Not works well with delta changes (can't update one file in partition) • Reading from partition during overwrite - return nonstable data • Orphan files – if you have and you are not aware - return nonstable data • Might require to add custom column (Day/Hour from ts) for partition
  • 5. But I want more power • I want to be able to query the data faster (in addition to partition) • I want Acid operation - if user read from s3 during write – he will get stable data ("Versions") • I want to update or delete one row without overwrite partition (‘Delta’). • I don’t want to create new columns day/hour (for partition) from the event timestamp • I want to be able to change the schema (add column , rename column) without breaking anything • I have new requirement: repartition by month but I don’t want to change old data structure.
  • 6. We can add Bucket - "hush for pre shuffle"
  • 7. Why bucket improve join queries ...
  • 9. Bucket with partition Iceberg – can partition by bucket
  • 11. What is Iceberg ? Table format • Its metadata about the ‘Parquet’ file – parquet data is no longer a black box for the engines. • Metadata is used to allow new capabilities
  • 12.
  • 13.
  • 14.
  • 16. Read flow 1. From catalog get current snapshot version 2. From snapshot get list of metadata files 3. Read metadata files and filter according to partition (and hidden partition), statistics 4. Read relevant files instead of list files in partitions • Let's say I want to find all events where user is “Jon”, and I only partition by day, account_id and I added bucket by country • without iceberg its full table scan (Instead of all s3 bucket I can reduce to specific day/account/country but that is all) • with iceberg I can also reduce number of files that will be read (explained later how)
  • 17.
  • 18.
  • 19.
  • 20. Example – create catalog “local” // use packages or add iceberg to spark jars // same for spark-sql ./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0 --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.type=hive --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.local.type=hadoop --conf spark.sql.catalog.local.warehouse=$PWD/warehouse --conf spark.sql.defaultCatalog=local --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  • 21. Create new table scala> spark.sql(” CREATE TABLE local.db.demoTbl2(id int , name string, ts timestamp) USING iceberg PARTITIONED BY (bucket(16, id), days(ts)) ") • This example missing “orderBy”
  • 22. Insert data scala> spark.sql(” insert into local.db.demoTbl2 (id, name, ts) values (1, "A", cast(date_format('2019-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp)), (2, "B", cast(date_format('2023-06-13 13:22:30.521000000', 'yyyy-MM-dd HH:mm:ss.SSS') as timestamp)) ") ; scala> val df = spark.table("local.db.demoTbl2"); scala> df.show(); +---+----+--------------------+ | id|name| ts| +---+----+--------------------+ | 1| A|2019-06-13 13:22:...| | 2| B|2023-06-13 13:22:...| +---+----+--------------------+
  • 23. Other options to insert data – “merge into” https://iceberg.apache.org/docs/latest/spark-writes/#merge-into
  • 24. Update / delete scala> spark.sql("update local.db.demoTbl2 set name='D' where id=2"); scala> val df = spark.table("local.db.demoTbl2"); scala> df.show(); +---+----+--------------------+ | id|name| ts| +---+----+--------------------+ | 2| D|2023-06-13 13:22:...| | 1| A|2019-06-13 13:22:...| +---+----+--------------------+
  • 25. COW vs MOR work is done during write in order to have fast read MOR has 2 flavor : • By position • By parameter
  • 26. COW • With Cow – new parquet will be created with new data, not diff changes Old parquet file New parquet file with all data not only diff
  • 27. MOR ./spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4 New parquet file with changes New avro file which mark what needs to be removed (using position approach) New parquet file with diff only Old parquet file Diff metadata
  • 29. Snapshot – Time travel – versions (GDPR…) scala> spark.table("local.db.demoTbl2.snapshots").show(); scala> spark.sql("select * from local.db.demoTbl2 VERSION AS OF 5581909029793327227").show(); +---+----+--------------------+ | id|name| ts| +---+----+--------------------+ | 1| A|2019-06-13 13:22:...| | 2| B|2023-06-13 13:22:...| +---+----+--------------------+
  • 30. Statistics Statistics help iceberg to filter files as it save per file max/min values for column... if data is sorted it will provide better results
  • 32. Sort: Global vs Local global sort requires repartitioning the data, so the entire dataset will be shuffled It's not about sort data in s3 It's about how to insert new data to partitions
  • 33. Order – how data is written by spark (shuffle) scala> spark.sql("ALTER TABLE local.db.demoTbl2 WRITE ORDERED BY name,ts") https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--write-ordered-by https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables
  • 35. Order by in query result
  • 36. Maintainance • Each change create new metadata.json file , snapshot , manifest • Recommended to merge metadata files, clean old files Json files
  • 38.
  • 39.
  • 40. Maintenance - Compaction by rewrite (per partition) • Recommended to use compaction before clean expired snapshot when use MOR
  • 41. Compaction with Sort – rewrite per partition scala> spark.sql("CALL local.system.rewrite_data_files(table => 'db.demoTbl2', strategy => 'sort', sort_order => 'zorder(name,ts)’)”) rewrite_data_files - combine small files into larger files to reduce metadata overhead and runtime file open cost. RewriteManifests - There are options which control minimum files to use during rewrite https://iceberg.apache.org/docs/latest/spark-procedures/ https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#field.summary Partition is sorted Table is not sorted
  • 42. Order vs rewrite • Using order by – add new files in the required order • Using rewrite to handle old data with sort strategy (and aggregation for hourly / daily / weekly …) – compress partitions + data per partition will be sorted • Advantage of rewrite with sort and statistics on name column: • search for all user where name = 'Jon'  iceberg will scan small number of parquet files per partition* as it can filter out files which has no names with "J"
  • 43. Reduce small files from the first place • write.distribution-mode • https://www.youtube.com/watch?v=4bOCDP-rhuM
  • 44. Spark job execution – recap A single computation unit performed on a single data partition is called a task. It is computed on a single core of the worker node. wide Transformation needs data shuffling between nodes which means data needs to be exchanged between nodes over the network (shuffle) wide transformation marks the end the stage and next stage starts https://www.analyticsvidhya.com/blog/2022/09/all-about-spark-jobs-stages-and-tasks/ https://www.linkedin.com/pulse/demystifying-spark-jobs-stages-data-shuffling-shahzad-aslam/
  • 45. Shuffle the data by partition id (good if the partitions are evenly distributed), Range partitioning (good to mitigate data skew) Nothing (good for few partitions, otherwise may lead to small files problem) Spark write data modes https://medium.com/@gh oshsiddharth25/partitionin g-vs-bucketing-in-apache- spark-a37b342082e4
  • 46. • The impact of setting table with "ORDERED BY" --> will set write.distribution-mode: range • The impact of setting table with "LOCALLY ORDERED BY" --> will set write.distribution- mode: none • https://stackoverflow.com/questions/74951477/avoid-shuffling-when-inserting- into-sorted-iceberg-table • df = spark.createDataFrame([(i, i*4) for i in range(100000)], ["a", "b"]).coalesce(1).sortWithinPartitions("a", "b") df.writeTo("datalakelocal.ixanezis.table").append() • If you prefer to avoid shuffle during inset 'sorted' data - set order within each task, not across tasks, use LOCALLY ORDERED BY Spark write data modes
  • 49. Main features • Change at file level not partition • Atomic read/write (read stable data) • Faster planning and execution • Hidden partition • Schema evaluation • Time travel • Compaction and cleanup of old data
  • 51. Big data 2.0 definition • When you need to handle massive amount of data • When you need to handle massive amount of metadata
  • 52. Resources • https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/ • https://www.youtube.com/watch?v=4bOCDP-rhuM • https://www.youtube.com/watch?v=CyhdGnqIf9o • https://iceberg.apache.org/docs/latest/maintenance/ • https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/ • https://iceberg.apache.org/docs/latest/evolution/ • https://iceberg.apache.org/docs/latest/configuration/ • https://iceberg.apache.org/docs/latest/spark-writes/#Spark • https://www.dremio.com/blog/how-z-ordering-in-apache-iceberg-helps-improve-performance/ • https://towardsdatascience.com/boost-your-cloud-data-applications-with-duckdb-and-iceberg-api-67677666fbd3 • https://medium.com/snowflake/understanding-iceberg-table-metadata-b1209fbcc7c3 • https://medium.com/snowflake/how-apache-iceberg-enables-acid-compliance-for-data-lakes-9069ae783b60 • https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/ • https://www.dremio.com/resources/tutorials/getting-started-with-apache-iceberg-using-aws-glue-and-dremio/ • https://www.matano.dev/blog/2022/11/04/automated-iceberg-table-maintenance • https://www.youtube.com/watch?v=ofRoRJuirFg • https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html • https://www.dremio.com/resources/webinars/why-and-how-netflix-created-and-migrated-to-a-new-table-format-iceberg/ • https://medium.com/getindata-blog/apache-spark-with-apache-iceberg-a-way-to-boost-your-data-pipeline-performance-and-safety-6f87364962a1 • https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53 • https://developer.hpe.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2x/ • https://senthilnayagan.com/apache-spark/2022/spark-bucketing-and-partitions • https://medium.com/@ghoshsiddharth25/partitioning-vs-bucketing-in-apache-spark-a37b342082e4 • https://medium.com/nerd-for-tech/apache-spark-bucketing-and-partitioning-8feab85d5136 • https://towardsdatascience.com/about-sort-in-spark-3-x-f3699cc31008 • https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce • https://www.linkedin.com/pulse/demystifying-spark-jobs-stages-data-shuffling-shahzad-aslam/
  • 53. If our queries only needs filter by day and account If we don’t have updates on the data If we don’t need versions of the data …. we can continue without Iceberge But If we need Iceberg can provide this capabilities