SlideShare a Scribd company logo
1 of 53
Download to read offline
Presentation title
ANDREY FALKO - APRIL 2019
Bulletproof Apache
Kafka®
with Fault
Tree Analysis
Agenda
The Challenge: Quantify Kafka Reliability
Introduction to Fault Tree Analysis
Kafka Fault Trees
Availability
Data Durability
Conclusion
The Challenge: Quantify Kafka Reliability
Kafka is a “reliability tool”
Move data without lossiness
High stakes usage
What are we trying to do?
Observability Data
Event Streaming
DB
Change Data Capture
DB
DB
DB DB
Why Quantify?
Determine probability of success
Find opportunities to trim cost
Defining SLOs
Need to define Service Level Objectives
Availability
Durability
Latency
Quantifying SLOs
Availability
What is the probability that writes or reads fail?
How long do we tolerate downtime?
Quantifying SLOs
Durability
What is the probability that we’ll lose data?
How much will we lose?
Quantifying SLOs
Latency
How long are transactions allowed to take?
Introduction to Fault Tree Analysis
What is Fault Tree Analysis?
Deductive Failure Analysis
Invented in 1962 for Minuteman I ICBM Launch Control System
Industry wide adoption
Aerospace
Military
Petrochemical
Et al.
Fault Tree Analysis: Event Symbols
Basic Intermediate
Transfer
Fault Tree Analysis: Gate Symbols
ANDOR
Fault Tree Analysis: OR Example
Fault Tree Analysis: OR Example
4% probability of failure
Fault Tree Analysis: OR Example
4% probability of failure
annualized
Fault Tree Analysis: OR Example
p(A “or” B) =
p(A) + p(B) - p(A) * p(B)
Almost always small
Fault Tree Analysis: OR Example
12% chance of failure
annualized
Fault Tree Analysis: AND Example
Fault Tree Analysis: AND Example
p(A and B) = p(A) * p(B)
Fault Tree Analysis: AND Example
99.9936% chance of
success annualized
Fault Tree Analysis: AND Example
99.9936% chance of success
annualized if we don’t
remediate
Fault Tree Analysis: AND Example
99.999999996% chance of
success if we remediate within
3 days
Fault Tree Analysis: AND Example
99.99999% chance of
success if we remediate
within 3 days
Fault Tree Analysis: AND Example
p(A and B) =
p(A) * p(B) =
(1 - e-p(A)*t
) * (1 - e-p(B)*t
)
Where t = time to remediate
If p(A) and p(B) < .1, approximate to
p(A)*p(B)*t2
Kafka Fault Trees
Can we write or read to a Kafka cluster?
Service Level Objective (SLO):
99.99% success rate per year
Availability
4% 2% 1% 1%
?
Availability
Availability
4% 2% 1% 1%.8% 2% 1% 1%
4.8% failure chance 8% failure chance
12.8% failure chance
Availability - Two brokers, single ZK
Availability - Collapse Host Faults
.8%
2% 1% 1%
4%
98.4% success chance
Availability - Multiple ZKs
99.4% success chance
Availability - Three Brokers
99.95% success chance
Availability - Summary
Success Probability Cost Per Nine*
Standalone 87.2% n/a
Two brokers, ISR=1, One ZK 98.36% 2
Two brokers, ISR=1, Three ZKs 99.36% 2
Two brokers, ISR=1, Five ZKs 99.36% 3
Three brokers, ISR=1, Three ZKs 99.95% 1.5
Three brokers, ISR=2, Three ZKs 99.36% 2.25
* Cost is computed in “disk units” / “number of nines”:
Kafka Broker Rotational Disk = .5
Zookeeper SSD Disk = 1
Lower is better
Availability - Broker SSD
Success Probability Cost Per Nine*
Standalone 90.4% 1
Two brokers, ISR=1, One ZK 99.08% 1.5
Two brokers, ISR=1, Three ZKs 99.77% 2.5
Two brokers, ISR=1, Five ZKs 99.77% 3.5
Three brokers, ISR=1, Three ZKs 99.99% 1.5
Three brokers, ISR=2, Three ZKs 99.77% 3
* SSD Disk = 1
Availability - Broker EBS
Success Probability Cost Per Nine*
Standalone 91.6% 1.5
Two brokers, ISR=1, One ZK 99.29% 2.25
Two brokers, ISR=1, Three ZKs 99.82% 3.75
Two brokers, ISR=1, Five ZKs 99.82% 5.25
Three brokers, ISR=1, Three ZKs 99.99% 4.5
Three brokers, ISR=2, Three ZKs 99.82% 2.25
* EBS disk units:
EBS SSD Disk = 1.5
Assumption that EBS fails at .2%
What are the chances of losing data?
Service Level Objective (SLO):
99.99999% durability per year
Durability
We lose data when all hosts with replicas go down
Assumptions:
6TB per broker (2TB per disk w/ RAID)
70MB/s replication rate
~24 hours to replicate full broker
We replace bad hosts almost immediately
Durability
Durability - Two brokers - One 6TB Disk
1 - e-p(A)*t
* 1 -
e-p(B)*t
99.999999% durability
Durability - Add Raid0
99.9999% durability
Durability - Three brokers
99.99999999% durability
Assumption:
48TB per broker
70MB/s replication rate
~8 days to replicate full broker
We replace bad hosts almost immediately
Durability
Durability - Three brokers 24 disks
99.999% completeness
24 disks
Durability - Summary
Data completeness Cost Per Nine*
Standalone 99.99% .125
Two brokers 99.999999% .5
Two brokers RAID0 99.9999% .22
Three brokers RAID0 99.99999999% .15
Three brokers RAID0 - 48TB 99.999% 1
* Cost is computed in “disk units” / “number of nines”:
Single non-raid disk = .5
Raid0 = .167
Zookeeper SSD Disk = 1
Lower is better
Latency
FTA focused on failures
Latency is not an inherent failure
Experiment with worst-case scenarios
Conclusion
Fault Tree Models: github.com/afalko/fta-kafka
OSS tool to draw and compute models: github.com/rakhimov/scram
How Not to Go Boom: Lessons for SREs from Oil Refineries by Emil Stolarsky
Fault Tree Analysis - A History by Clifton A. Ericson II
Fault Tree Handbook with Aerospace Applications by Dr. Michael Stamatelatos and Mr. José Caraballo
Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso
Solving Data Loss in Massive Storage Systems by Jason Resch
Failures at Scale and How to Ride Through Them by James Hamilton
Tools and References
FTA can be applied to:
Kafka Availability and Durability SLOs
Find cost savings
Uncover decisions that reduce reliability
Takeaways
Kafka on Kubernetes analysis
Improve scram-pra
Better FTA inputs via Distributed Tracing
Further Work
github.com/afalko/fta-kafka
Andrey Falko <afalko@lyft.com>
Thank You
Thank you!

More Related Content

What's hot

From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyAllen (Xiaozhong) Wang
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacreconfluent
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & StormOtto Mok
 
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnbalexismidon
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...confluent
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Amazon Web Services
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)Henning Spjelkavik
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
 
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...HostedbyConfluent
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingAshish Singh
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talkDanny Yuan
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Productionconfluent
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
 

What's hot (20)

From Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka JourneyFrom Three Nines to Five Nines - A Kafka Journey
From Three Nines to Five Nines - A Kafka Journey
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Ingesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah WhitacreIngesting Healthcare Data, Micah Whitacre
Ingesting Healthcare Data, Micah Whitacre
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Jitney, Kafka at Airbnb
Jitney, Kafka at AirbnbJitney, Kafka at Airbnb
Jitney, Kafka at Airbnb
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
Maximizing Audience Engagement in Media Delivery (MED303) | AWS re:Invent 2013
 
101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)101 ways to configure kafka - badly (Kafka Summit)
101 ways to configure kafka - badly (Kafka Summit)
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreaming
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 

Similar to Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit NYC 2019

Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Julia Angell
 
Does it still make sense to do Big Data with Small Nodes?
Does it still make sense to do Big Data with Small Nodes? Does it still make sense to do Big Data with Small Nodes?
Does it still make sense to do Big Data with Small Nodes? ScyllaDB
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...HostedbyConfluent
 
Redis Reliability, Performance & Innovation
Redis Reliability, Performance & InnovationRedis Reliability, Performance & Innovation
Redis Reliability, Performance & InnovationRedis Labs
 
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsSave Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsHostedbyConfluent
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档YUCHENG HU
 
Breda Development Meetup 2016-06-08 - High Availability
Breda Development Meetup 2016-06-08 - High AvailabilityBreda Development Meetup 2016-06-08 - High Availability
Breda Development Meetup 2016-06-08 - High AvailabilityBas Peters
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDBMongoDB
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesMichael Klishin
 
Java zone 2015 How to make life with kafka easier.
Java zone 2015 How to make life with kafka easier.Java zone 2015 How to make life with kafka easier.
Java zone 2015 How to make life with kafka easier.Krzysztof Debski
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Amazon Web Services
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Patternconfluent
 
AQM performance for VOIP
AQM performance for VOIPAQM performance for VOIP
AQM performance for VOIPMakkawy khair
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @JavaPeter Lawrey
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfRedis Labs
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)Sławomir Zborowski
 
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...HostedbyConfluent
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...HostedbyConfluent
 
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2Shay Ginsbourg
 

Similar to Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit NYC 2019 (20)

Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
 
Does it still make sense to do Big Data with Small Nodes?
Does it still make sense to do Big Data with Small Nodes? Does it still make sense to do Big Data with Small Nodes?
Does it still make sense to do Big Data with Small Nodes?
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
 
Redis Reliability, Performance & Innovation
Redis Reliability, Performance & InnovationRedis Reliability, Performance & Innovation
Redis Reliability, Performance & Innovation
 
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsSave Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud Costs
 
Percona XtraDB 集群文档
Percona XtraDB 集群文档Percona XtraDB 集群文档
Percona XtraDB 集群文档
 
Breda Development Meetup 2016-06-08 - High Availability
Breda Development Meetup 2016-06-08 - High AvailabilityBreda Development Meetup 2016-06-08 - High Availability
Breda Development Meetup 2016-06-08 - High Availability
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issues
 
Java zone 2015 How to make life with kafka easier.
Java zone 2015 How to make life with kafka easier.Java zone 2015 How to make life with kafka easier.
Java zone 2015 How to make life with kafka easier.
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
AQM performance for VOIP
AQM performance for VOIPAQM performance for VOIP
AQM performance for VOIP
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)
 
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
 
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
 

More from confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

More from confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit NYC 2019

  • 1. Presentation title ANDREY FALKO - APRIL 2019 Bulletproof Apache Kafka® with Fault Tree Analysis
  • 2. Agenda The Challenge: Quantify Kafka Reliability Introduction to Fault Tree Analysis Kafka Fault Trees Availability Data Durability Conclusion
  • 3. The Challenge: Quantify Kafka Reliability
  • 4. Kafka is a “reliability tool” Move data without lossiness High stakes usage What are we trying to do?
  • 8. Why Quantify? Determine probability of success Find opportunities to trim cost
  • 9. Defining SLOs Need to define Service Level Objectives Availability Durability Latency
  • 10. Quantifying SLOs Availability What is the probability that writes or reads fail? How long do we tolerate downtime?
  • 11. Quantifying SLOs Durability What is the probability that we’ll lose data? How much will we lose?
  • 12. Quantifying SLOs Latency How long are transactions allowed to take?
  • 13. Introduction to Fault Tree Analysis
  • 14. What is Fault Tree Analysis? Deductive Failure Analysis Invented in 1962 for Minuteman I ICBM Launch Control System Industry wide adoption Aerospace Military Petrochemical Et al.
  • 15. Fault Tree Analysis: Event Symbols Basic Intermediate Transfer
  • 16. Fault Tree Analysis: Gate Symbols ANDOR
  • 17. Fault Tree Analysis: OR Example
  • 18. Fault Tree Analysis: OR Example 4% probability of failure
  • 19. Fault Tree Analysis: OR Example 4% probability of failure annualized
  • 20. Fault Tree Analysis: OR Example p(A “or” B) = p(A) + p(B) - p(A) * p(B) Almost always small
  • 21. Fault Tree Analysis: OR Example 12% chance of failure annualized
  • 22. Fault Tree Analysis: AND Example
  • 23. Fault Tree Analysis: AND Example p(A and B) = p(A) * p(B)
  • 24. Fault Tree Analysis: AND Example 99.9936% chance of success annualized
  • 25. Fault Tree Analysis: AND Example 99.9936% chance of success annualized if we don’t remediate
  • 26. Fault Tree Analysis: AND Example 99.999999996% chance of success if we remediate within 3 days
  • 27. Fault Tree Analysis: AND Example 99.99999% chance of success if we remediate within 3 days
  • 28. Fault Tree Analysis: AND Example p(A and B) = p(A) * p(B) = (1 - e-p(A)*t ) * (1 - e-p(B)*t ) Where t = time to remediate If p(A) and p(B) < .1, approximate to p(A)*p(B)*t2
  • 30. Can we write or read to a Kafka cluster? Service Level Objective (SLO): 99.99% success rate per year Availability
  • 31. 4% 2% 1% 1% ? Availability
  • 32. Availability 4% 2% 1% 1%.8% 2% 1% 1% 4.8% failure chance 8% failure chance 12.8% failure chance
  • 33. Availability - Two brokers, single ZK
  • 34. Availability - Collapse Host Faults .8% 2% 1% 1% 4% 98.4% success chance
  • 35. Availability - Multiple ZKs 99.4% success chance
  • 36. Availability - Three Brokers 99.95% success chance
  • 37. Availability - Summary Success Probability Cost Per Nine* Standalone 87.2% n/a Two brokers, ISR=1, One ZK 98.36% 2 Two brokers, ISR=1, Three ZKs 99.36% 2 Two brokers, ISR=1, Five ZKs 99.36% 3 Three brokers, ISR=1, Three ZKs 99.95% 1.5 Three brokers, ISR=2, Three ZKs 99.36% 2.25 * Cost is computed in “disk units” / “number of nines”: Kafka Broker Rotational Disk = .5 Zookeeper SSD Disk = 1 Lower is better
  • 38. Availability - Broker SSD Success Probability Cost Per Nine* Standalone 90.4% 1 Two brokers, ISR=1, One ZK 99.08% 1.5 Two brokers, ISR=1, Three ZKs 99.77% 2.5 Two brokers, ISR=1, Five ZKs 99.77% 3.5 Three brokers, ISR=1, Three ZKs 99.99% 1.5 Three brokers, ISR=2, Three ZKs 99.77% 3 * SSD Disk = 1
  • 39. Availability - Broker EBS Success Probability Cost Per Nine* Standalone 91.6% 1.5 Two brokers, ISR=1, One ZK 99.29% 2.25 Two brokers, ISR=1, Three ZKs 99.82% 3.75 Two brokers, ISR=1, Five ZKs 99.82% 5.25 Three brokers, ISR=1, Three ZKs 99.99% 4.5 Three brokers, ISR=2, Three ZKs 99.82% 2.25 * EBS disk units: EBS SSD Disk = 1.5 Assumption that EBS fails at .2%
  • 40. What are the chances of losing data? Service Level Objective (SLO): 99.99999% durability per year Durability
  • 41. We lose data when all hosts with replicas go down Assumptions: 6TB per broker (2TB per disk w/ RAID) 70MB/s replication rate ~24 hours to replicate full broker We replace bad hosts almost immediately Durability
  • 42. Durability - Two brokers - One 6TB Disk 1 - e-p(A)*t * 1 - e-p(B)*t 99.999999% durability
  • 43. Durability - Add Raid0 99.9999% durability
  • 44. Durability - Three brokers 99.99999999% durability
  • 45. Assumption: 48TB per broker 70MB/s replication rate ~8 days to replicate full broker We replace bad hosts almost immediately Durability
  • 46. Durability - Three brokers 24 disks 99.999% completeness 24 disks
  • 47. Durability - Summary Data completeness Cost Per Nine* Standalone 99.99% .125 Two brokers 99.999999% .5 Two brokers RAID0 99.9999% .22 Three brokers RAID0 99.99999999% .15 Three brokers RAID0 - 48TB 99.999% 1 * Cost is computed in “disk units” / “number of nines”: Single non-raid disk = .5 Raid0 = .167 Zookeeper SSD Disk = 1 Lower is better
  • 48. Latency FTA focused on failures Latency is not an inherent failure Experiment with worst-case scenarios
  • 50. Fault Tree Models: github.com/afalko/fta-kafka OSS tool to draw and compute models: github.com/rakhimov/scram How Not to Go Boom: Lessons for SREs from Oil Refineries by Emil Stolarsky Fault Tree Analysis - A History by Clifton A. Ericson II Fault Tree Handbook with Aerospace Applications by Dr. Michael Stamatelatos and Mr. José Caraballo Failure Trends in a Large Disk Drive Population by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso Solving Data Loss in Massive Storage Systems by Jason Resch Failures at Scale and How to Ride Through Them by James Hamilton Tools and References
  • 51. FTA can be applied to: Kafka Availability and Durability SLOs Find cost savings Uncover decisions that reduce reliability Takeaways
  • 52. Kafka on Kubernetes analysis Improve scram-pra Better FTA inputs via Distributed Tracing Further Work