Different requirements (high availability, data residency) and high level designs for aws cross region data replication (S3 vs dynamodb vs kinesis vs couchbase vs cassandra). This talk will focus on requirements, data consistency and write conflicts (CRDT example). It is a "theoretical" talk in the sense that no Forter specific design is presented, and should guide architects that want to design their service with "cross-region" in mind.
2. Our financial institutions remain strong, and the American
economy will be open for business as well.
2/40
3. TX Fraud
Decision
100ms
Decision as a Service Example
if isFraud(tx.address,tx.payment) {
return DECLINE;
} else {
return APPROVE;
}
TX Decision
3/40
4. Event Processor
1000ms
Change Account Address
Change Account Payment
Unified People Store
TX
partial update
read
Decision as a Service Example
TX Fraud
Decision
100ms
TX Decision
4/40
5. Design בסדר יהיה
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
● No Cross Region Replication
5/40
6. Design עליי
● Cron Sync every 3 hours
● Replication != Reconciliation
● Replication != Backup
TX Fraud
Decision
Event
Processor
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
Cron Sync
raw event
TXDecision
6/40
7. ● Read-Only RDS Replica
● Proxying data into a single Data Center
● Requires quarterly failover drills
● Cannot stand a real disaster for long
Design פסדר יאללה
TX Fraud
Decision
Event
Forwarder
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
RDS Replication
raw event
TX
Forwarding
Decision
7/40
8. Design אחד במחיר שניים
● CloudEndure DRaaS
● Point In Time Recovery
● Requires quarterly failover drills
● For existing apps (Enterprises)
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
Block Device
Replication
8/40
9. Design חכה חכה
● Google Cloud Spanner Is Here
Geo Distributed Transactions Is Coming
● For green-field apps (Startups)
TX Fraud
Decision
Event
Processor
People
Store
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
Transactions
raw event
TXDecision
9/40
10. Design סמוך
● Out-Of-The-Box
Real-Time
Bi-Directional
Data-Center Aware
Replication
● Write Conflict resolution
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
2Way Replication
TX Fraud
Decision
Event
Processor
People
Store
raw event
TXDecision
10/40
11. Design שלה אחות
● Replication of Raw Events
● State Divergence
TX Fraud
Decision
TX Decision
Event
Processor
People
Store
raw event
2way Replication
TX Fraud
Decision
Event
Processor
People
Store
raw event
TXDecision
11/40
12. Read Consistency Guarantees
Loosely based on Consistency Explained Through Baseball by Doug Terry
● Strong ⇒ 2:2
○ See all previous writes
● Read own Writes
○ See all writes performed by reader
● Monotonic ⇒ 2:1
○ See all writes since the beginning till N seconds ago
● Eventual ⇒ 1:2
○ See the writes in different order (some still missing)
time partial
update
state
15m Hapoel =1 1:0
32m Maccabi =1 1:1
89m Hapoel =2 2:1
91m Maccabi =2 2:2
14/40
13. Hello Couchbase
read-mutate-write of entire state
Client reaches cluster’s primary node
Conflict Prevention CAS
Optimizations: subdocument API
Strong
node
us-west-2b
node
us-west-2c
Event Processor
(read/m/write)
TX Decision
(read)
Strong
16/40
14. Hello Couchbase
XDCR replicates entire state between clusters
Optimizations: dedup by key, metadata first
Strong
Monotonic
XDCR
node
us-west-2b
node
us-west-2c
Event Processor
(read/m/write)
node
us-east-1c
node
us-east-1b
TX Decision
(read)
TX Decision
(read)
Strong
17/40
23. Kafka Processor API and Local Store
Kafka
MirrorMaker
(?)
Kafka
S3 Connector
Kafka Stream API
סמוך
Design
Event Source
(insert)
kstream1
kstream2
ktable
Map process(Map event) {
Map state = kvStore.get(event.key);
state.putAll(event); // not commutative (order matters)
kvStore.put(event.key, state);
return state;
}
S3
32/40
24. CRDT Graph Model
Conflict-free Replicated Data Type
Idempotent, Commutative, Associative
● Insert Only Graph
● Address / Payment / Person Objects
25. G-Set: Growing Set CRDT
Conflict-free Replicated Data Type
Idempotent, Commutative, Associative
A B
us-west-2 event us-east-1 state
{A,B} {A,B}
26. G-Set: Growing Set CRDT
Conflict resolution method: merge sets
A
C
B
us-west-2 event us-east-1 state
{A,B} {A,B}
{A,C} {A,B,C}
27. Comprised of two G-Sets (added and tombstone)
A B
us-west-2 event us-east-1 state
add: {A,B}
rmv: {A}
add: {A,B}
rmv: {A}
2P-Set: Two Phase Set CRDT
28. A
C
B
us-west-2 event us-east-1 state
add: {A,B}
rmv: {A}
add: {A,B}
rmv: {A}
add: {A,C}
rmv: {B,D}
add: {A,B,C}
rmv: {A,B,D}
Always grows
Garbage Collection algorithms exist.
2P-Set: Two Phase Set CRDT
29. D
A
C
B
us-west-2 event us-east-1 state
add: {A,B}
rmv: {A}
add: {A,B}
rmv: {A}
add: {A,C}
rmv: {B,D}
add: {A,B,C}
rmv: {A,B,D}
add: {D} add: {A,B,C,D}
rmv: {A,B,D}
Always grows
Garbage Collection algorithms exist.
2P-Set: Two Phase Set CRDT
30. A
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
31. A
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {}
rmv_v: {A}
add_e: {}
rmv_e: {}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
32. A
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {}
rmv_v: {A}
add_e: {}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {A}
add_e: {AB,AC,BC}
rmv_e: {AB,AC}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
33. AD
C
B
us-west-2 event us-east-1 state
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {}
add_e: {AB,AC,BC}
rmv_e: {}
add_v: {}
rmv_v: {A}
add_e: {}
rmv_e: {}
add_v: {A,B,C}
rmv_v: {A}
add_e: {AB,AC,BC}
rmv_e: {AB,AC}
add_v: {D}
rmv_v: {}
add_e: {AD}
rmv_e: {}
2P2P-Graph CRDT
2P-Set for vertices, 2P-Set for edges
resolution method: remove wins
35. Sometimes the state won't converge easily
● Missing events (broken links)
○ integrity checks
○ repair
● Rerunning bulk events after downtime
○ Clocks: Event vs. Ingestion vs. Processor vs. Logical
○ Enrichment: IP address reputation changes daily
37/40
37. Takeaways
● Define business need for cross region
Availability, Latency, Residency, Analytics
● Know your NoSQL
Couchbase != Cassandra != Kafka
● Ask about CRDTs
LWW-Register, MV-Register, 2P-Sets, 2P2P-Graphs
● Use Reconciliation
● Dedicated Fiber and Atomic clocks ARE COMING
40/40
38. “The Internet was designed to be an academic medium.
It was not designed to handle this level of transactions”
Fred Matteson @ schwab.com 1999
39. Advanced Topics
● מרקחת לבית מאשר מטבחים לבית דומה יותר האמתי העולם
● Multi Data Center Topologies
○ Star (SPOF, simple)
○ Ring (TLV ←→ Eilat ←→ Jerusalem←→ TLV)
○ Mesh (resilient, complex)
● Data Residency
○ Separate PII from data
○ Peek at other data centers ad-hoc