SlideShare a Scribd company logo
1 of 25
Download to read offline
Zesty journey to adopt
Apache Iceberg
Eran Levy
@levyeran
Eran Levy
Data & Platform Group Lead @Zesty
https://levyeran.medium.com/
@levyeran
Introduction
@levyeran
How can you utilize Iceberg on AWS with no Spark
expertise in your team and going serverless all-in?
WIFM
@levyeran
Build Fast.
Stay Cost Efficient.
300%
Customer
Growth
YoY
$120M
Raised
Since
2020
406%
Employee
Growth
YoY
The Goal - Database Explorer
@levyeran
Medallion Architecture
@levyeran
Why did we choose
Apache Iceberg?
@levyeran
- Open data table format widely adopted and integrates well
with AWS ecosystem (Glue catalog, Athena, etc).
- Table evolution - Mainly schema and partitioning layout
(particularly hidden partitioning).
- Integrating well with many processing engines - supports our
long term strategy in leveraging the right technologies to their
needs.
While there are many cool
things in Iceberg,
There are some challenges…
The main challenge is:
Maintenance
@levyeran
Architecture
@levyeran
Table Configuration
@levyeran
- Iceberg v2 table, created with AWS Glue catalog and
Athena engine version 3 (preferably a dedicated
WorkGroup).
- Parquet with ZSTD compression - this is the data format we
adopted across our data lake.
- Snapshot age - 2 days (default is 5 days).
Athena allows predefined key-value TBLPROPERTIES only.
Glue catalog - Metadata tracking
@levyeran
Table Maintenance
Main maintenance operations for optimizing Iceberg table in
Athena:
1. VACUUM
2. OPTIMIZE
Table Maintenance
@levyeran
We are updating our Iceberg table frequently (every minute,
5GBs, insert/update, 50 columns, 500M records)…
So we wanted to VACUUM but were hitting the Athena query
limits:
Table Maintenance
@levyeran
Increasing the limits didn’t help much because we were hitting another :
ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but
there are more files remaining. Please run another VACUUM command to process the
remaining files
You can try overcome it by running AWS Step Functions in a loop like this suggested solution
Missing several runs and you will face another challenge as increasing Athena query limits
won’t help you much this time…
Table Maintenance
@levyeran
Same for OPTIMIZE but were hitting the partitions limitation:
Glue Spark ETL Jobs
@levyeran
In order to solve it for the long run, we decided to utilize the
Iceberg Spark procedures in order to perform our maintenance
jobs:
- Glue 3.0 and later supports Iceberg integration out of the
box
- Ad-hoc & built-in scheduler
- Integrated with CI/CD pipeline using AWS SDKs
Nice AWS blog and an AWS Glue Developer Guide are available
Glue Spark Maintenance Jobs
@levyeran
Basically the most important steps to perform are:
- Register the Iceberg connector for AWS Glue (Not required
for Glue 4.0)
- Create ETL Job or a Jupyter Notebook
- Provide the necessary configuration to the Spark
job/notebook such as: –datalake-formats and –conf
NOTE: these actions automatically inject the Iceberg Spark SQL
extension
@levyeran
Full example is available here: https://github.com/eran-levy/iceberg-journey-session-examples
Glue Spark Maintenance Jobs
Glue Spark Maintenance Jobs
@levyeran
Main maintenance procedures:
● Expire_snapshots
● Rewrite_data_files
● Remove_orphan_files
● Rewrite_manifests
QuickSight for Apache Iceberg metadata analysis
@levyeran
Not
Optimized
Optimized!
Snapshots after optimization
@levyeran
Snapshots after expiration procedure
@levyeran
QuickSight for Apache Iceberg data files analysis
@levyeran
Summary
● Apache Iceberg is well adopted in the industry and
specifically in the AWS ecosystem.
● It's not persist & forget -> take Iceberg maintenance into
consideration while choosing your architecture.
● Keep monitoring -> your partitioning strategy might
change, file size, query latencies, etc. as there are many
moving parts that can impact your performance.
Next Steps
● Choosing our data lakehouse platform
● Maintenance is an issue as we scale to additional use
cases with larger data volume - we might need a
managed service to assist us here

More Related Content

Similar to Zesty journey to adopt Apache Iceberg

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azureIke Ellis
 
Azure Nights August2017
Azure Nights August2017Azure Nights August2017
Azure Nights August2017Michael Frank
 
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Knut Relbe-Moe [MVP, MCT]
 
Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Tom Laszewski
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logsMathew Beane
 
How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...Alex Levashov
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)Mathew Beane
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Antonios Chatzipavlis
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1Ivan Ma
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®MariaDB plc
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 
Peteris Arajs - Where is my data
Peteris Arajs - Where is my dataPeteris Arajs - Where is my data
Peteris Arajs - Where is my dataAndrejs Vorobjovs
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWSThanh Nguyen
 

Similar to Zesty journey to adopt Apache Iceberg (20)

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Talend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech OverviewTalend Summer '17 Release: New Features and Tech Overview
Talend Summer '17 Release: New Features and Tech Overview
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Evolve18 | Ameeth Palla | Optimizing Your Assets Implementation
Evolve18 | Ameeth Palla | Optimizing Your Assets ImplementationEvolve18 | Ameeth Palla | Optimizing Your Assets Implementation
Evolve18 | Ameeth Palla | Optimizing Your Assets Implementation
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
Azure Nights August2017
Azure Nights August2017Azure Nights August2017
Azure Nights August2017
 
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
Unity Connect - Getting SQL Spinning with SharePoint - Best Practices for the...
 
Oracle on AWS partner webinar series
Oracle on AWS partner webinar series Oracle on AWS partner webinar series
Oracle on AWS partner webinar series
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logs
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...How to collect Google Analytics events to your own data warehouse and do it o...
How to collect Google Analytics events to your own data warehouse and do it o...
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019Modernizing your database with SQL Server 2019
Modernizing your database with SQL Server 2019
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 
Peteris Arajs - Where is my data
Peteris Arajs - Where is my dataPeteris Arajs - Where is my data
Peteris Arajs - Where is my data
 
Serverless Data Lake on AWS
Serverless Data Lake on AWSServerless Data Lake on AWS
Serverless Data Lake on AWS
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 

Zesty journey to adopt Apache Iceberg

  • 1. Zesty journey to adopt Apache Iceberg Eran Levy @levyeran
  • 2. Eran Levy Data & Platform Group Lead @Zesty https://levyeran.medium.com/ @levyeran Introduction @levyeran
  • 3. How can you utilize Iceberg on AWS with no Spark expertise in your team and going serverless all-in? WIFM @levyeran
  • 6. The Goal - Database Explorer @levyeran
  • 8. Why did we choose Apache Iceberg? @levyeran - Open data table format widely adopted and integrates well with AWS ecosystem (Glue catalog, Athena, etc). - Table evolution - Mainly schema and partitioning layout (particularly hidden partitioning). - Integrating well with many processing engines - supports our long term strategy in leveraging the right technologies to their needs.
  • 9. While there are many cool things in Iceberg, There are some challenges… The main challenge is: Maintenance @levyeran
  • 11. Table Configuration @levyeran - Iceberg v2 table, created with AWS Glue catalog and Athena engine version 3 (preferably a dedicated WorkGroup). - Parquet with ZSTD compression - this is the data format we adopted across our data lake. - Snapshot age - 2 days (default is 5 days). Athena allows predefined key-value TBLPROPERTIES only. Glue catalog - Metadata tracking
  • 12. @levyeran Table Maintenance Main maintenance operations for optimizing Iceberg table in Athena: 1. VACUUM 2. OPTIMIZE
  • 13. Table Maintenance @levyeran We are updating our Iceberg table frequently (every minute, 5GBs, insert/update, 50 columns, 500M records)… So we wanted to VACUUM but were hitting the Athena query limits:
  • 14. Table Maintenance @levyeran Increasing the limits didn’t help much because we were hitting another : ICEBERG_VACUUM_MORE_RUNS_NEEDED: Removed 1000 files in this round of vacuum, but there are more files remaining. Please run another VACUUM command to process the remaining files You can try overcome it by running AWS Step Functions in a loop like this suggested solution Missing several runs and you will face another challenge as increasing Athena query limits won’t help you much this time…
  • 15. Table Maintenance @levyeran Same for OPTIMIZE but were hitting the partitions limitation:
  • 16. Glue Spark ETL Jobs @levyeran In order to solve it for the long run, we decided to utilize the Iceberg Spark procedures in order to perform our maintenance jobs: - Glue 3.0 and later supports Iceberg integration out of the box - Ad-hoc & built-in scheduler - Integrated with CI/CD pipeline using AWS SDKs Nice AWS blog and an AWS Glue Developer Guide are available
  • 17. Glue Spark Maintenance Jobs @levyeran Basically the most important steps to perform are: - Register the Iceberg connector for AWS Glue (Not required for Glue 4.0) - Create ETL Job or a Jupyter Notebook - Provide the necessary configuration to the Spark job/notebook such as: –datalake-formats and –conf NOTE: these actions automatically inject the Iceberg Spark SQL extension
  • 18. @levyeran Full example is available here: https://github.com/eran-levy/iceberg-journey-session-examples Glue Spark Maintenance Jobs
  • 19. Glue Spark Maintenance Jobs @levyeran Main maintenance procedures: ● Expire_snapshots ● Rewrite_data_files ● Remove_orphan_files ● Rewrite_manifests
  • 20. QuickSight for Apache Iceberg metadata analysis @levyeran Not Optimized Optimized!
  • 22. Snapshots after expiration procedure @levyeran
  • 23. QuickSight for Apache Iceberg data files analysis @levyeran
  • 24. Summary ● Apache Iceberg is well adopted in the industry and specifically in the AWS ecosystem. ● It's not persist & forget -> take Iceberg maintenance into consideration while choosing your architecture. ● Keep monitoring -> your partitioning strategy might change, file size, query latencies, etc. as there are many moving parts that can impact your performance.
  • 25. Next Steps ● Choosing our data lakehouse platform ● Maintenance is an issue as we scale to additional use cases with larger data volume - we might need a managed service to assist us here