Streaming Data from Scylla to Kafka with CDC

Building Event Streaming
Architectures on Scylla and
Conﬂuent with Kafka

Tim Berglund
Senior Director of Developer Advocacy
Presenters
Alexys Jacob
CTO
Maheedhar Gunturu
Director of Technical Alliances
Othmane El Metioui
Chief Data Oﬃcer

Agenda
ᐩ Brief Intro to Scylla
ᐩ Scylla + Kafka at Numberly
ᐩ Change Data Capture in Scylla
ᐩ Streaming Data from Scylla to
Kafka

About ScyllaDB
4
• Reimagined the NoSQL database
• Close-to-the-hardware design, written in C++
• Open source, enterprise & DBaaS
• From the creators of KVM hypervisor
Winner Infoworld
Technology
of the Year

5
Grows with your business & your data
– Volume –
Multi-petabyte
– Throughput –
1 billion OPS
– Horizontal Scalability –
1,000-node cluster
– Availability –
1 to 10+ replicas
within a datacenter
– Consistent Latencies –
Low single-digit millisecond p99s
– Vertical Scalability –
1 to 416 vCPUs
– Unlimited –
Cell sizes and
partition width
– Consistency Options –
Eventual consistency
to linearizability

6
Used across industries
AdTech/MarTech
Multimedia Finance/FinTech Security
Ride-hailing/
Food Delivery
Social Retail Travel IoT Logistics/Transportation

Deployment options
Install in Your Datacenter
➔ Scylla Open Source
➔ Scylla Enterprise
➔ AWS Outposts
Deploy at a Cloud Provider
➔ Scylla Open Source
➔ Scylla Enterprise
Database as a Service
➔ Fully managed Scylla
clusters
➔ Bring Your Own Acct
(BYOA) option
On-Prem Cloud Hosted Scylla Cloud
7
Run on Kubernetes
➔ Manage with Scylla
Operator
Kubernetes

8
Scylla + Kafka at
Architectural choices and overview

9
At Numberly, we run bare-metal clusters
Scylla
3 clusters, with multi-datacenter
topology
• Staging
• Production web facing
• Production OLAP+OLTP
• RF=3 per DC
DELL hardware
• RAID0 NVMe
• up to 96 AMD cores per node
• up to 512GB RAM per node
Conﬂuent Kafka
2 clusters, with active-active multi-datacenter
topology
• Staging
• Production
DELL hardware
• 6 brokers
12 TB SSD ( RAID0 )
2x 24 cores
64GB RAM
• 12 other nodes
Connect cluster, Schema Registry,
Zookeepers...

10
Scylla Cloud &
Conﬂuent Cloud
TL;DR: The people behind the technology know better!
Cloud hosted solutions should be considered
depending on your infrastructure maturity and hosting
constraints.
Our experience shows that cloud providers such as
AWS always lag behind versions and provide poor
monitoring & alerting capabilities.

11
Scylla + Kafka at
Stack usage overview

Scylla
• Scylla Manager
• Scylla Monitoring
• Easy data expiration (TTL) on large time
windows (6+ months)
Combining Scylla and Conﬂuent Kafka powers
Conﬂuent Kafka
• Kafka Connect & Exporter
• Schema registry
• KSQL
• Home-made control center interface +
grafana
Started with in-house Kafka streams
and Python pipelines to propagate
data changes between Scylla & Kafka
12

Scylla
• Scylla Manager
• Scylla Monitoring
• Easy data expiration (TTL) on large time
windows (6+ months)
Confluent Kafka
• Kafka Connect & Exporter
• Schema registry
• KSQL
• Home-made control center interface +
grafana
Combining Scylla and Confluent Kafka powers
The Confluent certified CDC
connector will simplify our pipelines!
13

14
Scylla + Kafka at
Scylla is used as a low-latency remote state store
providing easy data expiry capabilities
to Kafka streams and pipelines (in & out)

Use case #1
Data pipeline enrichment
Scylla to the rescue in overcoming a too large
JOIN window for Kafka
15

Use case #1: how we did it before
The
Speaker’s
camera
displays
here
16
Numberly’s
web tracking
RabbitMQ exchange
Scylla 13+ months retention
High throughput writes
+
Low latency reads, expiring data
beanstalkd
Python
programs
write + read

Use case #1: our ﬁrst attempt
The
Speaker’s
camera
displays
here
17
Numberly’s
web tracking
Kafka streams
Compacted topic
read
Kafka streams
write
Kafka connect
Ktable
redis

Scaling limitations of Kafka JOIN windows
• The retention of our source data enriched from Scylla is long (13+ months)
Data set size average of 150+GB per table, totaling 1.2+TB source data
• Multiple successive JOINs is heavy on Kafka on large datasets
Large state store on RocksDB memory issues caused Kubernetes pod OOM kills
Rebuilding the state store after Kafka streams restart ( pod ) was too long
Standby replicas comes with a cost for large state store
We turned to Scylla to be a remote, highly available, distributed state store!
18

Use case #1: how we do it today
The
Speaker’s
camera
displays
here
19
Numberly’s
web tracking
Kafka streams
Scylla 13+ months retention
High throughput writes
+
Low latency reads, expiring data
read
Kafka streams
write

Use case #1: takeaways
• Metrics
Metrics are important to a successful tuning (query response times, dataset size)
Use prometheus client instead of implementing kafka streams metrics
• Tuning
Size the number of partitions regarding your query metrics
Mind your time to recovery: max throughput capacity should be at least 3x the average
Add Query caching that should cover your average query time, no more to maximize consistency
Make sure you use a shard aware client for Scylla
The
Speaker’s
camera
displays
here
20

Use case #2
Scylla “most innovative use case” award
winning Synapse platform
Real time user segmentation
Kafka to the rescue in overcoming large
partitions
on Scylla for an OLAP statistical workload
21

Use case #2: Synapse platform
The
Speaker’s
camera
displays
here
22
Numberly’s web tracking
Synapse services
Business rules
Partners
calculation
Segmentation store
distribution
conﬁguration

Kafka & Scylla: a complementary match
Where we chose Scylla over native Kafka
● Large number of tables with different sizes
○ Would create 10000+ topics if compact tables were used instead of Scylla
● TTL management on kafka compact table adds custom processing logic and complexity
○ Propagating Scylla expired data events stills adds complexity
○ We crave for expiration events in CDC
(https://github.com/scylladb/scylla/issues/8380)
● Leverage Scylla low latency reads capability to consume or enrich data at scale
Where Kafka saved the day for Scylla
● Compute real time stats on high cardinality data generated large partitions on Scylla
○ A user (partition key) is part of multiple segments (cluster key) = counting OK
○ A segment (partition key) has a great lot of users (cluster key) = large partition =
counting KO
23

Use case #2: takeaways
Define your table models to suit your queries
Forecast data volume on your model before using it
• Will it fit at scale in the technology you plan to use?
Mind large partitions on Scylla as it can damage your cluster performance
Kafka streams are great for on the fly aggregations
Sink your aggregated data to an external store to address multiple time spans lookups
• Interactive queries = hot real time
The
Speaker’s
camera
displays
here
24

25
Scylla + Kafka at
They play (very) well
together

Change Data Capture
(CDC) in Scylla
Maheedhar Gunturu
26

Change Data Capture (CDC)
Queries the history of changes made to your database.
• Asynchronously readable by downstream consumers.
• Available since Scylla Open Source 4.0 and now available in
Scylla Enterprise 2021.1.1
27

Use cases
• Application propagating state using various microservices for
use cases like IOT, retail , security, fraud detection, customer
360
• ETL
• Integrations, migrations and streaming transformations
• Alerting and monitoring
28

CDC in Scylla: enabled per table
• Single CDC log table per enabled table
• CDC log is co-located with base table
• Partitioning matches the base table
• Mirrored columns for preimage/delta records
• Every column record contains information about modiﬁcation
operation and TTL
• Rows ordered by operation timestamp and batch sequence
• CDC data is TTL:ed to 24h (conﬁgurable)
29

Scylla’s CDC write path
+ Coordinator creates CDC log table
+ Writes and piggybacks on base table
+ Writes to same replica nodes.
+ While data size written is larger, the
number of writes requests does not
change.
INSERT INTO base_table(...)...
CQL
CDC write
30

CDC log rows
• Each mutation event generates one or more rows
Row keys
Changes per non-key column (delta) – optional
Pre-image (prior state) — optional
Post-image (current state of row) – optional
• CDC log write uses same consistency level as base write
Same data guarantees
31

Consume CDC streams aka read path
• CDC data is available through normal CQL
Easy to read raw streams
Already de-duplicated
All delta and pre image values are normal CQL data
Can consume without knowledge of server internals
• Layered approach
CDC core functionality relatively simple. Allows for more
sophisticated adaptors
■ Push models etc.
32

Consume CDC streams aka read path
+ CDC data is grouped into streams
+ Divides the token ring space
+ Each stream represents a tokenization “slot”
in current topology
+ Stream is log partition key
+ Stream chosen for given write based on base
table PK tokenization
+ CDC is also the basis for Alternator
Streams (DynamoDB API)
33

CDC in Scylla
+ Easy to integrate and consume
+ Plain CQL tables
+ Robust
+ Replicated in same way as the base data
+ Reasonable overhead
+ Coalesced writes and reads to same replica ranges
+ Overhead is comparable to adding/reading from a table
+ Does not overﬂow if consumer fails to act
+ Data is TTL:ed
34

Streaming Data from
Scylla to Kafka
Tim Berglund

Streaming Data from Scylla
to Kafka
Tim Berglund

Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Confluent
Our goal is to create an event streaming platform
and put it at the heart of every company.
We do this with a platform that builds on Apache
Kafka, available on-prem and in Confluent Cloud.

Partition 0
Partition 1
Partition 2
Writing to Kafka

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. 40
Partition 0
Partition 1
Partition 2
Partitioned Topic
Consumer A
Consumer B
Reading from Kafka

Copyright 2021, Conﬂuent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Conﬂuent, Inc. 41
Partition 0
Partition 1
Partition 2
Partitioned Topic
Consumer A
Consumer B
Consumer A
Consumer A
Reading from Kafka

Kafka Connect

Scylla Source
Connector for Kafka

Kafka, Conﬂuent, and Scylla
Scylla Source
connector for Kafka is
built on open source
Debezium
debezium.io

Source Connector

Sink Connector

Syncing Scylla Clusters with Kafka
Use the Source and Sink connectors to exchange data
between separate Scylla clusters

How it Works
1. Set up a Scylla Table with CDC
cqlsh> CREATE KEYSPACE ks WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck)) WITH cdc = {'enabled': true};

2. Conﬁgure Kafka Scylla CDC Connector
name=ScyllaCDCConnector
connector.class=com.scylladb.cdc.debezium.connector.ScyllaConnector
scylla.name=MyCluster
scylla.cluster.ip.addresses=127.0.0.2:9042
scylla.table.names=ks.t
tasks.max=10
transforms=unwraptransforms.unwrap
type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=false
transforms.unwrap.delete.handling.mode=none
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
heartbeat.interval.ms=1000
auto.create.topics.enable=true
How it Works

How it Works
3. Test the Connector
cqlsh> INSERT INTO ks.t(pk, ck, v) VALUES (1, 5, 10);
cqlsh> INSERT INTO ks.t(pk, ck, v) VALUES (2, 6, 12);

How it Works
4. Connector correctly replicates as JSON:
Kafka message number 1 (key):
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int32",
"optional": true,
"field": "ck"
},
{
"type": "int32",
"optional": true,
"field": "pk"
}
],
"optional": false,
"name": "ks.t.Key"
},
"payload": {
"ck": 5,
"pk": 1
}
}
Kafka message number 1 (value):
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int32",
"optional": true,
"field": "ck"
},
{
"type": "int32",
"optional": true,
"field": "pk"
},
{
"type": "struct",
"fields": [
{
"type": "int32",
[*snip* Etc.]

Deltas Only...for now
• Currently only provides delta operations
• Preimage and postimage will be added in the future
• Will match nicely with “before” & “after” ﬁelds of
Debezium

Conﬂuent Developer
developer.conﬂuent.io
54
Learn Kafka!

United States
545 Faber Place
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank you

Streaming Data from Scylla to Kafka with CDC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Streaming Data from Scylla to Kafka with CDC

Similar to Streaming Data from Scylla to Kafka with CDC (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Streaming Data from Scylla to Kafka with CDC