SlideShare a Scribd company logo
1 of 22
Tuning Flink Clusters for
stability and efficiency
Divye Kapoor, Pinterest
Flink Forward 2023 ©
Starting with the end in mind…
By the end of this talk, you’ll know how
we tuned our Flink clusters to reduce
per-job costs by 50-90% (~75%
typical) and how we were able to
absorb ~40% workload for free for
Pinterest.
25%
Cost of a job after we
were done with our work
Hi! I’m Divye Kapoor, I’m the TL for the
Stream Processing Platform at Pinterest and
I’m here presenting the work that our teams at
Pinterest have done over the past 2 years on
stability and efficiency.
Flink Forward 2023 ©
Credits
Teja T. - SRE & EM
CGroups, Cluster HW, Rollouts &
optimizations at several levels of
the stack.
Thank you
Flink Forward 2023 ©
Credits: Leadership & Partners
30+ teams
200+ partners
4 orgs
Flink Forward 2023 ©
Actual spend vs Budgeted spend for the company ($ terms).
Flink Forward 2023 ©
Let’s just say that a fair bit of money was printed for Pinterest…
Flink Forward 2023 ©
Our clusters run on YARN (today), everything that follows is in that context.
Flink Forward 2023 ©
So what’s challenging about running
and tuning a Flink multi tenant cluster?
● Job sizes: 2000+ cores on a job
vs jobs with < 10 cores.
● Job tiering: small jobs that can’t
fail and other jobs that can.
● Multitenant efficiency: resource
use that isn’t wasteful.
● Multitenant priority: in an incident,
keep the right jobs working.
● Noisy neighbors
● Data skew
Flink Forward 2023 ©
CGroups
● CGroups was our must-have for
everything that follows. Teja led the
charge.
1. We upgraded YARN and then
configured it to support soft CGroup
limits. (The limits only kick in if the
host is running out of capacity)
2. We verified that if a host is at capacity,
the resources are fairly shared.
3. We started running the cluster hotter
(no CPU starvation!).
Flink Forward 2023 ©
CGroups
● Hard limits don’t work well for Flink jobs.
● Most Flink jobs want to burst on CPU on
deploys and this setup allows for the catch
up to take place without throttling.
● Hard limits can trigger OOMs, back
pressure and other stability issues.
Generally, it’s not clear if the job will come
back after a restart.
Lesson 1: Always configure
your YARN or K8s cluster to
avoid hard limits / throttles.
Flink Forward 2023 ©
Container Placement: Stability & Cost Opt.
● No Hot Nodes please!
● Container Placement is critical to
keeping a stable cluster running.
● We want all applications to be well
behaved and work well with our job
schedulers.
● Bad container scheduling = host
running out of capacity at peak.
Flink Forward 2023 ©
Container Placement: Option 1
Caption
CPU Aware: Schedule on hosts where CPU utilization is < 50th percentile
Flink Forward 2023 ©
Container Placement: Option 2
Caption
Config: yarn.nodemanager.resource.cpu-vcores = 75% of cores on host
Flink Forward 2023 ©
No traffic-peak stability issues
seen after the container placement
strategy was implemented.
Stability is a prerequisite for optimization
Flink Forward 2023 ©
Job Optimization
● Source of significant wins - task placements & vertical sharding.
● Required a full round of re-optimization of our job configurations.
● Mass migrations & rollouts - we got good at it.
● 70%+ reduction in cross-host network traffic for jobs.
● Jobs became 50-90%+ cheaper to run.
● Serialization & Traffic overhead drops.
● Magic: Removing SSGs, aligning parallelism across operators, forcing
“ColocationConstraints” and optimizing Flink 1.11 task placements.
Flink Forward 2023 ©
Job Optimization
Before: CPU utilization showing skewed load. This is wasteful because the lightly loaded
Task Managers are asking for the same resources as the heavily loaded ones.
Flink Forward 2023 ©
Hardware optimization: i3 to i4i
~40% reduction in CPU utilization per job.
Flink Forward 2023 ©
Our last wins:
Input Data optimization:
Only read the data the job needs from Kafka. Where
appropriate, we split the Kafka topics.
Autotuning: We built an in-house autotuner so that
we don’t need to keep re-tuning our jobs for CPU
utilization.
These will be covered separately in other talks in the future.
Flink Forward 2023 ©
Recap:
● CGroups
● Soft Limits
● Run clusters
hotter
● Container
Placement
strategy
● Job re-tuning
● Job optimization
● Job retuning
● Hardware
upgrades
● Input Data
optimization
● Job autotuning
Stage 1 Stage 2 Stage 3 Stage 4
Flink Forward 2023 ©
Our total wins were ~fairly large.
The end result is a nice clean up of
the costs on the streaming stack.
Job costs on Flink were a discussion
point. After optimizations, these
concerns have melted away.
75%
Job cost reduction through
improved placement of Tasks
on Task Managers.
40%
Job cost reduction through
hardware upgrades.
20%
Cluster cost reduction through
CGroups and the ability to run
the clusters hotter.
%ages don’t sum up to 100 as the baselines are different
Flink Forward 2023 ©
Actual spend vs Budgeted spend for the company ($ terms).
CGroups
Job Optimizations
Hw upgrade
Data opt.
Flink Forward 2023 ©
Thank you
http://divye.me - to connect on LinkedIn

More Related Content

Similar to Tuning Flink Clusters for stability and efficiency

Webinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControlWebinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Severalnines
 

Similar to Tuning Flink Clusters for stability and efficiency (20)

Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
MuleSoft Sizing Guidelines - VirtualMuleys
MuleSoft Sizing Guidelines - VirtualMuleysMuleSoft Sizing Guidelines - VirtualMuleys
MuleSoft Sizing Guidelines - VirtualMuleys
 
On component interface
On component interfaceOn component interface
On component interface
 
Key considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKey considerations in productionizing streaming applications
Key considerations in productionizing streaming applications
 
Design choices of golang for high scalability
Design choices of golang for high scalabilityDesign choices of golang for high scalability
Design choices of golang for high scalability
 
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CDMulesoft Meetup Milano #9 - Batch Processing and CI/CD
Mulesoft Meetup Milano #9 - Batch Processing and CI/CD
 
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
 
Migrating from oracle soa suite to microservices on kubernetes
Migrating from oracle soa suite to microservices on kubernetesMigrating from oracle soa suite to microservices on kubernetes
Migrating from oracle soa suite to microservices on kubernetes
 
Next-Generation Kubernetes Optimization: Optimize Live 2.0
Next-Generation Kubernetes Optimization: Optimize Live 2.0Next-Generation Kubernetes Optimization: Optimize Live 2.0
Next-Generation Kubernetes Optimization: Optimize Live 2.0
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Agile Framework and Scrum
Agile Framework and ScrumAgile Framework and Scrum
Agile Framework and Scrum
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
 
Parallel Batch Performance Considerations
Parallel Batch Performance ConsiderationsParallel Batch Performance Considerations
Parallel Batch Performance Considerations
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
 
Migration to the cloud
Migration to the cloudMigration to the cloud
Migration to the cloud
 
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControlWebinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControl
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 

More from Divye Kapoor

More from Divye Kapoor (6)

A particle filter based scheme for indoor tracking on an Android Smartphone
A particle filter based scheme for indoor tracking on an Android SmartphoneA particle filter based scheme for indoor tracking on an Android Smartphone
A particle filter based scheme for indoor tracking on an Android Smartphone
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
The Linux Kernel Implementation of Pipes and FIFOs
The Linux Kernel Implementation of Pipes and FIFOsThe Linux Kernel Implementation of Pipes and FIFOs
The Linux Kernel Implementation of Pipes and FIFOs
 
Cybermania Prelims
Cybermania PrelimsCybermania Prelims
Cybermania Prelims
 
Cybermania Mains
Cybermania MainsCybermania Mains
Cybermania Mains
 
IPv6
IPv6IPv6
IPv6
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Tuning Flink Clusters for stability and efficiency

  • 1. Tuning Flink Clusters for stability and efficiency Divye Kapoor, Pinterest
  • 2. Flink Forward 2023 © Starting with the end in mind… By the end of this talk, you’ll know how we tuned our Flink clusters to reduce per-job costs by 50-90% (~75% typical) and how we were able to absorb ~40% workload for free for Pinterest. 25% Cost of a job after we were done with our work Hi! I’m Divye Kapoor, I’m the TL for the Stream Processing Platform at Pinterest and I’m here presenting the work that our teams at Pinterest have done over the past 2 years on stability and efficiency.
  • 3. Flink Forward 2023 © Credits Teja T. - SRE & EM CGroups, Cluster HW, Rollouts & optimizations at several levels of the stack. Thank you
  • 4. Flink Forward 2023 © Credits: Leadership & Partners 30+ teams 200+ partners 4 orgs
  • 5. Flink Forward 2023 © Actual spend vs Budgeted spend for the company ($ terms).
  • 6. Flink Forward 2023 © Let’s just say that a fair bit of money was printed for Pinterest…
  • 7. Flink Forward 2023 © Our clusters run on YARN (today), everything that follows is in that context.
  • 8. Flink Forward 2023 © So what’s challenging about running and tuning a Flink multi tenant cluster? ● Job sizes: 2000+ cores on a job vs jobs with < 10 cores. ● Job tiering: small jobs that can’t fail and other jobs that can. ● Multitenant efficiency: resource use that isn’t wasteful. ● Multitenant priority: in an incident, keep the right jobs working. ● Noisy neighbors ● Data skew
  • 9. Flink Forward 2023 © CGroups ● CGroups was our must-have for everything that follows. Teja led the charge. 1. We upgraded YARN and then configured it to support soft CGroup limits. (The limits only kick in if the host is running out of capacity) 2. We verified that if a host is at capacity, the resources are fairly shared. 3. We started running the cluster hotter (no CPU starvation!).
  • 10. Flink Forward 2023 © CGroups ● Hard limits don’t work well for Flink jobs. ● Most Flink jobs want to burst on CPU on deploys and this setup allows for the catch up to take place without throttling. ● Hard limits can trigger OOMs, back pressure and other stability issues. Generally, it’s not clear if the job will come back after a restart. Lesson 1: Always configure your YARN or K8s cluster to avoid hard limits / throttles.
  • 11. Flink Forward 2023 © Container Placement: Stability & Cost Opt. ● No Hot Nodes please! ● Container Placement is critical to keeping a stable cluster running. ● We want all applications to be well behaved and work well with our job schedulers. ● Bad container scheduling = host running out of capacity at peak.
  • 12. Flink Forward 2023 © Container Placement: Option 1 Caption CPU Aware: Schedule on hosts where CPU utilization is < 50th percentile
  • 13. Flink Forward 2023 © Container Placement: Option 2 Caption Config: yarn.nodemanager.resource.cpu-vcores = 75% of cores on host
  • 14. Flink Forward 2023 © No traffic-peak stability issues seen after the container placement strategy was implemented. Stability is a prerequisite for optimization
  • 15. Flink Forward 2023 © Job Optimization ● Source of significant wins - task placements & vertical sharding. ● Required a full round of re-optimization of our job configurations. ● Mass migrations & rollouts - we got good at it. ● 70%+ reduction in cross-host network traffic for jobs. ● Jobs became 50-90%+ cheaper to run. ● Serialization & Traffic overhead drops. ● Magic: Removing SSGs, aligning parallelism across operators, forcing “ColocationConstraints” and optimizing Flink 1.11 task placements.
  • 16. Flink Forward 2023 © Job Optimization Before: CPU utilization showing skewed load. This is wasteful because the lightly loaded Task Managers are asking for the same resources as the heavily loaded ones.
  • 17. Flink Forward 2023 © Hardware optimization: i3 to i4i ~40% reduction in CPU utilization per job.
  • 18. Flink Forward 2023 © Our last wins: Input Data optimization: Only read the data the job needs from Kafka. Where appropriate, we split the Kafka topics. Autotuning: We built an in-house autotuner so that we don’t need to keep re-tuning our jobs for CPU utilization. These will be covered separately in other talks in the future.
  • 19. Flink Forward 2023 © Recap: ● CGroups ● Soft Limits ● Run clusters hotter ● Container Placement strategy ● Job re-tuning ● Job optimization ● Job retuning ● Hardware upgrades ● Input Data optimization ● Job autotuning Stage 1 Stage 2 Stage 3 Stage 4
  • 20. Flink Forward 2023 © Our total wins were ~fairly large. The end result is a nice clean up of the costs on the streaming stack. Job costs on Flink were a discussion point. After optimizations, these concerns have melted away. 75% Job cost reduction through improved placement of Tasks on Task Managers. 40% Job cost reduction through hardware upgrades. 20% Cluster cost reduction through CGroups and the ability to run the clusters hotter. %ages don’t sum up to 100 as the baselines are different
  • 21. Flink Forward 2023 © Actual spend vs Budgeted spend for the company ($ terms). CGroups Job Optimizations Hw upgrade Data opt.
  • 22. Flink Forward 2023 © Thank you http://divye.me - to connect on LinkedIn