Fault Tolerance at Speed

Fault Tolerance at Speed
Todd L. Montgomery
@toddlmontgomery
StoneTor

InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
aeron-cluster-raft/

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com

What type of Fault Tolerance?
What is Clustering?
Why Aeron?
Design for Speeding Up?

What type of Fault Tolerance?
What is Clustering?
Why Aeron?
Design for Speeding Up?
Efﬁciency

https://www.forbes.com/sites/forbestechcouncil/2017/12/15/why-energy-is-a-big-and-rapidly-growing-problem-for-data-centers/#344456665a30
https://www.datacenterdynamics.com/opinions/power-consumption-data-centers-global-problem/
https://www.nature.com/articles/d41586-018-06610-y

We seem to assume
efﬁciency/security/quality/etc.
is a “special” characteristic added
… later… if at all

Service
Client
ServiceService
Client Client

Service
Client
ServiceService
Client Client
State

Service ServiceService
State “Storage”

State
Partition Replication

Contiguous Log
with
Snapshot & Replay

1
State
2
3
4
5
6
X
…
Snapshot

1
State
2
3
4
5
6
X
…
Snapshot
5
6
X
…
Snapshot
State

Log ArchiveLog Archive Log Archive

Replicated State Machines
https://en.wikipedia.org/wiki/State_machine_replication

Each Replicated Service
Same event log
Same input ordering
Log replicated locally

Checkpoints / Snapshots
Event in the log
“Rolling” up previous log events

When should a service “consume”
(or process) a log event?

ArchiveArchive Archive
1 2 3 4 5 6 1 2 3 4 5 6 71 2

Once processed,
Event can not be altered
Only process once event is stable

Raft Consensus
Event must be recorded at majority
of Replicas before being consumed
by any Replica
https://raft.github.io/

Strong Leader
Elected member of the Cluster
Orders Input
Disseminates Consensus
Raft

Archive ArchiveArchive
Consensus ConsensusConsensus

Raft is
An algorithm with formal veriﬁcation

Raft is not
A speciﬁcation
Nor
A complete system

More than Raft
Leader timestamps events
Async, not RPC-based
Timers
The Real World

Archive ArchiveArchive
Consensus ConsensusConsensus
*Leader
Client

Determinism
Log is immutable
Log can be played, stopped, & replayed
Each event is timestamped
Services restarted from snapshot & log
Beneﬁts

Distributed Key/Value Store
Distributed Timers
Distributed Locks

Matching Engines
Order Management
Market Surveillance
P&L, Risk, …
Finance

Venue Ticketing / Reservations
Auctions
Beyond
Hint - a contended database is a good indicator

Efﬁcient reliable UDP unicast, UDP
multicast, and IPC message transport
Java, C/C++, C#, Go
Aeron
https://github.com/real-logic/Aeron

And a little bit more…
Very fast Archival & Replay
Aeron
https://github.com/real-logic/Aeron

All communications
Aeron publications & subscriptions
Aeron archival & replay
Aeron shared counters

Consensus
based on Aeron stream position

Batching
Critical to efﬁcient operation
Optimizing pipelined throughput

Flow Control
Critical to correct operation

Cache Hit/Miss Ratios
Branch Prediction
Allocation Rates
Garbage Collection
Inlining
Optimizations

Ownership, Dependency, & Coupling
Complexity
Layers of Abstraction (ain’t free)
Resource Management

Closer… But…
Still. Not. Yet.

"AmdahlsLaw" by Daniels220 at English Wikipedia - Own work based on: File:AmdahlsLaw.png. Licensed under CC BY-SA 3.0 via Wikimedia Commons

Universal Scalability Law
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32 64 128 256 512 1024
Speedup
Processors
Amdahl USL

Breakdown Interactions
Fundamental Sequential Operations

Ingress Message, Sequence, Disseminate
Client
Follower X
Leader
Ingress
Follower Y
Log (multicast or serial unicast)
Member Status
Log
Event
Log
Event

Followers Append
Client
Follower X
Leader
Ingress
Follower Y
Member Status
Append
Position
Append
Position

Commit Message
Client
Follower X
Leader
Ingress
Follower Y
Member Status
Commit
Position
Commit
Position

Breakdown Interactions
Pipeline-able Operation & Batching

FollowerLeader
Member Status
Commit Position @4096
Append Position @6912
Log Event @8192
Stream Positions
Archive Position @8096 Archive Position @7168
Store locally asynchronous to
Position processing by Consensus, &
Log processing by Service
Batching: Log, Appends, Commits

Doesn’t this Complicate Recovery?

Follower
Recovery Positions
Archive Position @8096 Archive Position @7168
A synchronous system doesn’t make this complexity go away!
Election still needs to assert state of the cluster & locally catch-up
Follower Follower
Archive Position @7584
Commit Position @4096 Commit Position @4064 Commit Position @4032
Service Position @4096 Service Position @4064 Service Position @3776

Limitations of Efﬁciency
Throughput & Latency

Client FollowersLeader
Ingress
Member Status
Commit Position
Append Position
Log Event
Client to Service A: 0.5 RTT
Client to Service Ox: 1 RTT
Client to Service A (on Commit): 1.5 RTT
Client to Service Ox (on Commit): 2 RTT
Constant Delay Network
Service A Service Ox
Round-Trip Time (RTT)

Client to Service A: 50ns
Client to Service Ox: 100ns
Client to Service A (on Commit): 150ns
Client to Service Ox (on Commit): 200ns
Limits from Constant Delay
Shared Memory RTT <100ns
Client to Service A: 50us
Client to Service Ox: 100us
Client to Service A (on Commit): 150us
Client to Service Ox (on Commit): 200us
DC RTT <100us
Client to Service A: 5us
Client to Service Ox: 10us
Client to Service A (on Commit): 15us
Client to Service Ox (on Commit): 20us
Rack (Kernel Bypass) RTT <10us

Measured Latency at Throughput
RTT(us)
0
75
150
225
300
Percentile
Min 0.50 0.90 0.99 0.9999 0.999999 Max
100K msgs/sec 200K msgs/sec
Intel Xeon Gold 5118 (2.30GHz, 12 cores)
32GB DDR4 2400 MHz ECC RAM
Intel Optane SSD 900P Series 480GB
SolarFlare X2522-PLUS 10GbE NIC
All servers are connected to an Arista
7150S
CentOS Linux 7.7, kernel
4.4.195-1.el7.elrepo.x86_64 tuned for
low-latency workload.
Courtesy Mark Price
Single client session, bursts of 20x 200B messages, 3-node cluster, Service(s) echo(es) the payload back.

Takeways
Efﬁciency is part of design
Power of a timestamped, replicated log

Current Status
Aeron Archiving - fully supported
Aeron Clustering - pre-release
Sponsored by
https://weareadaptive.com/

Aeron: https://github.com/real-logic/Aeron
Twitter: @toddlmontgomery
Thank You!
Questions?
StoneTor

Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
aeron-cluster-raft/

Fault Tolerance at Speed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fault Tolerance at Speed

Similar to Fault Tolerance at Speed (20)

More from C4Media

More from C4Media (20)

Recently uploaded

Recently uploaded (20)

Fault Tolerance at Speed