Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/36epVKg.
Todd Montgomery discusses the techniques and lessons learned from implementing Aeron Cluster. His focus is on how Raft can be implemented on Aeron, minimizing the network round trip overhead, and comparing single process to a fully distributed cluster. Filmed at qconsf.com.
Todd Montgomery is a networking hacker who has researched, designed, and built numerous protocols, messaging-oriented middleware systems, and real-time data systems, done research for NASA, contributed to the IETF and IEEE, and co-founded two startups. He currently works as an independent consultant and is active in several open source projects.
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
aeron-cluster-raft/
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
34. Raft Consensus
Event must be recorded at majority
of Replicas before being consumed
by any Replica
Replicated State Machines
https://raft.github.io/
65. Ingress Message, Sequence, Disseminate
Client
Follower X
Leader
Ingress
Follower Y
Log (multicast or serial unicast)
Member Status
Log
Event
Log
Event
69. FollowerLeader
Log (multicast or serial unicast)
Member Status
Commit Position @4096
Append Position @6912
Log Event @8192
Stream Positions
Archive Position @8096 Archive Position @7168
Store locally asynchronous to
Position processing by Consensus, &
Log processing by Service
Batching: Log, Appends, Commits
71. Follower
Recovery Positions
Archive Position @8096 Archive Position @7168
A synchronous system doesn’t make this complexity go away!
Election still needs to assert state of the cluster & locally catch-up
Follower Follower
Archive Position @7584
Commit Position @4096 Commit Position @4064 Commit Position @4032
Service Position @4096 Service Position @4064 Service Position @3776
73. Client FollowersLeader
Ingress
Log (multicast or serial unicast)
Member Status
Commit Position
Append Position
Log Event
Client to Service A: 0.5 RTT
Client to Service Ox: 1 RTT
Client to Service A (on Commit): 1.5 RTT
Client to Service Ox (on Commit): 2 RTT
Constant Delay Network
Service A Service Ox
Round-Trip Time (RTT)
74. Client to Service A: 50ns
Client to Service Ox: 100ns
Client to Service A (on Commit): 150ns
Client to Service Ox (on Commit): 200ns
Limits from Constant Delay
Shared Memory RTT <100ns
Client to Service A: 50us
Client to Service Ox: 100us
Client to Service A (on Commit): 150us
Client to Service Ox (on Commit): 200us
DC RTT <100us
Client to Service A: 5us
Client to Service Ox: 10us
Client to Service A (on Commit): 15us
Client to Service Ox (on Commit): 20us
Rack (Kernel Bypass) RTT <10us
75. Measured Latency at Throughput
RTT(us)
0
75
150
225
300
Percentile
Min 0.50 0.90 0.99 0.9999 0.999999 Max
100K msgs/sec 200K msgs/sec
Intel Xeon Gold 5118 (2.30GHz, 12 cores)
32GB DDR4 2400 MHz ECC RAM
Intel Optane SSD 900P Series 480GB
SolarFlare X2522-PLUS 10GbE NIC
All servers are connected to an Arista
7150S
CentOS Linux 7.7, kernel
4.4.195-1.el7.elrepo.x86_64 tuned for
low-latency workload.
Courtesy Mark Price
Single client session, bursts of 20x 200B messages, 3-node cluster, Service(s) echo(es) the payload back.