This document discusses building event streaming architectures using Scylla and Confluent Kafka. It provides an overview of Scylla and how it can be used with Kafka at Numberly. It then discusses change data capture (CDC) in Scylla and how to stream data from Scylla to Kafka using Kafka Connect and the Scylla source connector. The Kafka Connect framework and connectors allow capturing changes from Scylla tables in Kafka topics to power downstream applications and tasks.
2. Tim Berglund
Senior Director of Developer Advocacy
Presenters
Alexys Jacob
CTO
Maheedhar Gunturu
Director of Technical Alliances
Othmane El Metioui
Chief Data Officer
3. Agenda
ᐩ Brief Intro to Scylla
ᐩ Scylla + Kafka at Numberly
ᐩ Change Data Capture in Scylla
ᐩ Streaming Data from Scylla to
Kafka
4. About ScyllaDB
4
• Reimagined the NoSQL database
• Close-to-the-hardware design, written in C++
• Open source, enterprise & DBaaS
• From the creators of KVM hypervisor
Winner Infoworld
Technology
of the Year
5. 5
Grows with your business & your data
– Volume –
Multi-petabyte
– Throughput –
1 billion OPS
– Horizontal Scalability –
1,000-node cluster
– Availability –
1 to 10+ replicas
within a datacenter
– Consistent Latencies –
Low single-digit millisecond p99s
– Vertical Scalability –
1 to 416 vCPUs
– Unlimited –
Cell sizes and
partition width
– Consistency Options –
Eventual consistency
to linearizability
7. Deployment options
Install in Your Datacenter
➔ Scylla Open Source
➔ Scylla Enterprise
➔ AWS Outposts
Deploy at a Cloud Provider
➔ Scylla Open Source
➔ Scylla Enterprise
Database as a Service
➔ Fully managed Scylla
clusters
➔ Bring Your Own Acct
(BYOA) option
On-Prem Cloud Hosted Scylla Cloud
7
Run on Kubernetes
➔ Manage with Scylla
Operator
Kubernetes
9. 9
At Numberly, we run bare-metal clusters
Scylla
3 clusters, with multi-datacenter
topology
• Staging
• Production web facing
• Production OLAP+OLTP
• RF=3 per DC
DELL hardware
• RAID0 NVMe
• up to 96 AMD cores per node
• up to 512GB RAM per node
Confluent Kafka
2 clusters, with active-active multi-datacenter
topology
• Staging
• Production
DELL hardware
• 6 brokers
12 TB SSD ( RAID0 )
2x 24 cores
64GB RAM
• 12 other nodes
Connect cluster, Schema Registry,
Zookeepers...
10. 10
Scylla Cloud &
Confluent Cloud
TL;DR: The people behind the technology know better!
Cloud hosted solutions should be considered
depending on your infrastructure maturity and hosting
constraints.
Our experience shows that cloud providers such as
AWS always lag behind versions and provide poor
monitoring & alerting capabilities.
12. Scylla
• Scylla Manager
• Scylla Monitoring
• Easy data expiration (TTL) on large time
windows (6+ months)
Combining Scylla and Confluent Kafka powers
Confluent Kafka
• Kafka Connect & Exporter
• Schema registry
• KSQL
• Home-made control center interface +
grafana
Started with in-house Kafka streams
and Python pipelines to propagate
data changes between Scylla & Kafka
12
13. Scylla
• Scylla Manager
• Scylla Monitoring
• Easy data expiration (TTL) on large time
windows (6+ months)
Confluent Kafka
• Kafka Connect & Exporter
• Schema registry
• KSQL
• Home-made control center interface +
grafana
Combining Scylla and Confluent Kafka powers
The Confluent certified CDC
connector will simplify our pipelines!
13
14. 14
Scylla + Kafka at
Scylla is used as a low-latency remote state store
providing easy data expiry capabilities
to Kafka streams and pipelines (in & out)
15. Use case #1
Data pipeline enrichment
Scylla to the rescue in overcoming a too large
JOIN window for Kafka
15
16. Use case #1: how we did it before
The
Speaker’s
camera
displays
here
16
Numberly’s
web tracking
RabbitMQ exchange
Scylla 13+ months retention
High throughput writes
+
Low latency reads, expiring data
beanstalkd
Python
programs
write + read
17. Use case #1: our first attempt
The
Speaker’s
camera
displays
here
17
Numberly’s
web tracking
Kafka streams
Compacted topic
read
Kafka streams
write
Kafka connect
Ktable
redis
18. Scaling limitations of Kafka JOIN windows
• The retention of our source data enriched from Scylla is long (13+ months)
Data set size average of 150+GB per table, totaling 1.2+TB source data
• Multiple successive JOINs is heavy on Kafka on large datasets
Large state store on RocksDB memory issues caused Kubernetes pod OOM kills
Rebuilding the state store after Kafka streams restart ( pod ) was too long
Standby replicas comes with a cost for large state store
We turned to Scylla to be a remote, highly available, distributed state store!
18
19. Use case #1: how we do it today
The
Speaker’s
camera
displays
here
19
Numberly’s
web tracking
Kafka streams
Scylla 13+ months retention
High throughput writes
+
Low latency reads, expiring data
read
Kafka streams
write
20. Use case #1: takeaways
• Metrics
Metrics are important to a successful tuning (query response times, dataset size)
Use prometheus client instead of implementing kafka streams metrics
• Tuning
Size the number of partitions regarding your query metrics
Mind your time to recovery: max throughput capacity should be at least 3x the average
Add Query caching that should cover your average query time, no more to maximize consistency
Make sure you use a shard aware client for Scylla
The
Speaker’s
camera
displays
here
20
21. Use case #2
Scylla “most innovative use case” award
winning Synapse platform
Real time user segmentation
Kafka to the rescue in overcoming large
partitions
on Scylla for an OLAP statistical workload
21
22. Use case #2: Synapse platform
The
Speaker’s
camera
displays
here
22
Numberly’s web tracking
Synapse services
Business rules
Partners
calculation
Segmentation store
distribution
configuration
23. Kafka & Scylla: a complementary match
Where we chose Scylla over native Kafka
● Large number of tables with different sizes
○ Would create 10000+ topics if compact tables were used instead of Scylla
● TTL management on kafka compact table adds custom processing logic and complexity
○ Propagating Scylla expired data events stills adds complexity
○ We crave for expiration events in CDC
(https://github.com/scylladb/scylla/issues/8380)
● Leverage Scylla low latency reads capability to consume or enrich data at scale
Where Kafka saved the day for Scylla
● Compute real time stats on high cardinality data generated large partitions on Scylla
○ A user (partition key) is part of multiple segments (cluster key) = counting OK
○ A segment (partition key) has a great lot of users (cluster key) = large partition =
counting KO
23
24. Use case #2: takeaways
Define your table models to suit your queries
Forecast data volume on your model before using it
• Will it fit at scale in the technology you plan to use?
Mind large partitions on Scylla as it can damage your cluster performance
Kafka streams are great for on the fly aggregations
Sink your aggregated data to an external store to address multiple time spans lookups
• Interactive queries = hot real time
The
Speaker’s
camera
displays
here
24
27. Change Data Capture (CDC)
Queries the history of changes made to your database.
• Asynchronously readable by downstream consumers.
• Available since Scylla Open Source 4.0 and now available in
Scylla Enterprise 2021.1.1
27
28. Use cases
• Application propagating state using various microservices for
use cases like IOT, retail , security, fraud detection, customer
360
• ETL
• Integrations, migrations and streaming transformations
• Alerting and monitoring
28
29. CDC in Scylla: enabled per table
• Single CDC log table per enabled table
• CDC log is co-located with base table
• Partitioning matches the base table
• Mirrored columns for preimage/delta records
• Every column record contains information about modification
operation and TTL
• Rows ordered by operation timestamp and batch sequence
• CDC data is TTL:ed to 24h (configurable)
29
30. Scylla’s CDC write path
+ Coordinator creates CDC log table
+ Writes and piggybacks on base table
+ Writes to same replica nodes.
+ While data size written is larger, the
number of writes requests does not
change.
INSERT INTO base_table(...)...
CQL
CDC write
30
31. CDC log rows
• Each mutation event generates one or more rows
Row keys
Changes per non-key column (delta) – optional
Pre-image (prior state) — optional
Post-image (current state of row) – optional
• CDC log write uses same consistency level as base write
Same data guarantees
31
32. Consume CDC streams aka read path
• CDC data is available through normal CQL
Easy to read raw streams
Already de-duplicated
All delta and pre image values are normal CQL data
Can consume without knowledge of server internals
• Layered approach
CDC core functionality relatively simple. Allows for more
sophisticated adaptors
■ Push models etc.
32
33. Consume CDC streams aka read path
+ CDC data is grouped into streams
+ Divides the token ring space
+ Each stream represents a tokenization “slot”
in current topology
+ Stream is log partition key
+ Stream chosen for given write based on base
table PK tokenization
+ CDC is also the basis for Alternator
Streams (DynamoDB API)
33
34. CDC in Scylla
+ Easy to integrate and consume
+ Plain CQL tables
+ Robust
+ Replicated in same way as the base data
+ Reasonable overhead
+ Coalesced writes and reads to same replica ranges
+ Overhead is comparable to adding/reading from a table
+ Does not overflow if consumer fails to act
+ Data is TTL:ed
34
38. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Confluent
Our goal is to create an event streaming platform
and put it at the heart of every company.
We do this with a platform that builds on Apache
Kafka, available on-prem and in Confluent Cloud.
39. Partition 0
Partition 1
Partition 2
Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Writing to Kafka
40. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. 40
Partition 0
Partition 1
Partition 2
Partitioned Topic
Consumer A
Consumer B
Reading from Kafka
41. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. 41
Partition 0
Partition 1
Partition 2
Partitioned Topic
Consumer A
Consumer B
Consumer A
Consumer A
Reading from Kafka
42. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Kafka Connect
44. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Kafka, Confluent, and Scylla
Scylla Source
connector for Kafka is
built on open source
Debezium
debezium.io
45. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Source Connector
46. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Sink Connector
47. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Syncing Scylla Clusters with Kafka
Use the Source and Sink connectors to exchange data
between separate Scylla clusters
48. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
How it Works
1. Set up a Scylla Table with CDC
cqlsh> CREATE KEYSPACE ks WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh> CREATE TABLE ks.t(pk int, ck int, v int, PRIMARY KEY(pk, ck)) WITH cdc = {'enabled': true};
49. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
2. Configure Kafka Scylla CDC Connector
name=ScyllaCDCConnector
connector.class=com.scylladb.cdc.debezium.connector.ScyllaConnector
scylla.name=MyCluster
scylla.cluster.ip.addresses=127.0.0.2:9042
scylla.table.names=ks.t
tasks.max=10
transforms=unwraptransforms.unwrap
type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=false
transforms.unwrap.delete.handling.mode=none
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
heartbeat.interval.ms=1000
auto.create.topics.enable=true
How it Works
50. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
How it Works
3. Test the Connector
cqlsh> INSERT INTO ks.t(pk, ck, v) VALUES (1, 5, 10);
cqlsh> INSERT INTO ks.t(pk, ck, v) VALUES (2, 6, 12);
51. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
How it Works
4. How it looks
cqlsh> SELECT * FROM ks.t_scylla_cdc_log ;
cdc$stream_id | cdc$time | cdc$batch_seq_no | cdc$deleted_v | cdc$end_of_batch | cdc$operation | cdc$ttl | ck | pk | v
------------------------------------+--------------------------------------+------------------+---------------+------------------+---------------+---------+----+----+----
0xc72400000000000045715fd9dc0004c1 | a2130246-4048-11eb-5b81-9b458669aa11 | 0 | null | True | 2 | null | 5 | 1 | 10
0xd049555555555556e69dc1b6b4000581 | a6723136-4048-11eb-a309-3e76e3b340e7 | 0 | null | True | 2 | null | 6 | 2 | 12
52. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
How it Works
4. Connector correctly replicates as JSON:
Kafka message number 1 (key):
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int32",
"optional": true,
"field": "ck"
},
{
"type": "int32",
"optional": true,
"field": "pk"
}
],
"optional": false,
"name": "ks.t.Key"
},
"payload": {
"ck": 5,
"pk": 1
}
}
Kafka message number 1 (value):
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int32",
"optional": true,
"field": "ck"
},
{
"type": "int32",
"optional": true,
"field": "pk"
},
{
"type": "struct",
"fields": [
{
"type": "int32",
[*snip* Etc.]
53. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Deltas Only...for now
• Currently only provides delta operations
• Preimage and postimage will be added in the future
• Will match nicely with “before” & “after” fields of
Debezium
54. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Confluent Developer
developer.confluent.io
54
Learn Kafka!