Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches."

Vitalii Bondarenko
Eleks
Fast Data Platform
for Real-Time Analytics. Architecture And Approaches

Agenda
 Fast data vs Big Data
 Kafka overview
 Cassandra Internals and Programming
 Architectures and Approaches
 Lessons learned

Big Data Approach
RDBMS Approach
• Massive Parallel Processing (Scalability)
• In-memory DB (Streaming and
compressing)
• Colum stores (BI)
Big Data Approach
• Hadoop (HDFS + MapReduce)
• SQL on HDFS
• Scalable NoSQL
• Batch issue

HDFS (Hadoop Distributed File System)
• Data is spited into blocks and
distributed across the nodes
• Nodes are cheap
• Block size is 64 or 124 MB
• Replication
• Files are typical not updated
• Read data from the beginning
to the end
• Smaller number of larger files

Lambda architecture
Batch & Stream Processing
• Batch layer
• Stores master dataset
• Compute arbitrary views
• Horizontally Scalable
• Speed layer (Streaming)
• Fast, incremental algorithms
• Batch layer eventually overrides speed
layer
• Serving layer
• Random access to batch views
• Updated by batch and Streaming layer

Kappa architecture
Stream Processing with Scalable Storages
• Everything is a stream
• Immutable unstructured data sources
• Single analytics framework
• Windows on Streaming Layer
• Linearly scalable Serving Layer
• Interactive querying

Azure Streaming Analytics
• Easy to use
• Scalable
• Connectivity
• SQL, UDF, Reference Data

Fast Data Platform
• Real-time processing
• In-memory analytics
Fog Computing /
Service Bus
Kafka
Cluster
• Row Data fast writing
• Scalable
Connectors(Source)
Connectors(Sink)
Stream
Stream
Stream
Tasks
Stream
Tasks
. . .
Hadoop Cluster Cheep Data StorageBI Tools
Cassandra
Cluster
Cassandra Replication
spar
k
spar
k
spar
k
spar
k
spar
k
spar
k
1
2
3
4
5
n
1
2
3
4
5
nCassandra
Cluster
Write-Heavy
Stream
Analytics
Row Data
Cassandra
Cluster
Analytics

Apache Kafka
• From LinkedIn, Open Source from 2012
• Service Bus
• Small messages (events)
• Scalable Broker System
• Durable and Distributed
• Very fast (parallelism on partitions)
• No removes from queue, retention
• Streaming processing capability
LinkedIn:
• 1400 brokers
• 13M+ messages/sec
• 2.75GB per second

Brokers, Topics
• Distributed Service Bus
• Broker as virtual servers
• Topics as logical data storage

Writes and reads
• Append Only
• Commit log
• Consumer offset (from beginning)
• Commit read to Kafka topic _consumer_offsets
• Commit when read data
• Retention period 7 days
• Parallel read with consumer groups
• Confluent Platform

Data Sources Enterprise Analytic Suite Data Destination
Clickstream
Shopping
Behaviour
Orders
Purchases
Inventory
Routs
Catalog
Prices
Campaigns
Customers
LocationsDATA LAKE
FAST
DATA
ESB
DWH
Cost Optimization
Profit Maximization
Utility Maximization
DATA LAKE
FAST
DATA
ESB
DWH
Data Visualization
Dashboard
Service Bus
Kafka
Machine Learning Cluster Kubernetes /OpenShift
Kafka Stream API
DataConnectorsMapping
DataConnectorsMapping
Capacity
Planing
Product
Recommendation
Customer
Segmentation
Price
Recommendation
Routs
Optimization
Demand
Prediction
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
ML Cluster on Streaming Data

Apache Cassandra
• Multi-master, low-latency, shared nothing
• Distributed
• No single point of failure
• Linearly Scalable
• Multi-datacenter configuration
• AP with tunable consistency

Nodes and distributions
• Data Centers and Racks, Gossip (each 1 sec)
• Distributed by Tokens from -2^63 to 2^63-1
• Hash from partition key. Murmur3
• Virtual Nodes

Coordinator Nodes
• Replication Strategy (SimpleStrategy,
NetworkTopologyStrategy)
• Replication factor (usually 3)
• Consistency Levels (One, Two, Three, Any, All,
Quorum, Local_Quorum, Local_One…)
• Tunable consistency, strong and eventural
• (R +W) > N

Cassandra Objects
Column, which is a name/value pair
Row, which is a container for columns referenced
by a primary key
Table, which is a container for rows
Keyspace, which is a container for tables
Cluster, which is a container for keyspaces that
spans one or more nodes
CQL (Cassandra Query Language)
CREATE TABLE loads (
machine inet,
cpu int,
mtime timeuuid,
load float,
PRIMARY KEY ((machine, cpu), mtime)
) WITH CLUSTERING ORDER BY (mtime DESC);

Internals
Memtable corresponds to CQL Tale
Commit Logs all data for data restoring
SSTables for data in immutable saves
Key Caches caching map of partitions keys
Row Caches for read access speed up
Hints for write request for failed nodes
Tombstones for deleted rows
TTL for deleting rows
Updates are Inserts
Inserts are Updates
Compaction for merging SSTables

CQL (Cassandra Query Language)
• Similar to SQL
• No Joins
• Keyspaces with replication factor
• Inserts vs Updates
• TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/
• DELETE is INSERT
• Ordering and Filtering is not working sometimes (always use partition key)
/* Select Data within a range */
SELECT * FROM myTable WHERE myField > 5000 AND myField < 100000;
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.

Data Modeling: Query-First Design

Cassandra is not
A Data Lake
A Data Ocean
A Data Pond
A Data Warehouse
A In-memory Database
A Key-value store
A Magic database unicorn that fairs rainbow

Data
Center
Cassandra + Spark Cluster
Unstructed and
Structured
Data
Operational Data
Spark SQL
Dashboards
(Pentaho)
Model Traning
Framework (Python,
Anaconda, jupyter
BI Tools
(Power BI)
Hadoop Cluster
Scalable DWH
ML Results
Historial Data
HDFS
Spark
Hive
Redshift /
Postgres XL
Enterprise integration
Enterprise
Applications
ML ResultsESB
(WSO2)
ScoredData
Kubernetes Cluster
Transformation
Rules REDIS
Trained Models
REDIS
Kafka Sreams
API
Transformation
Kafka Sreams
API
ML Scoring
REST API
(Flask)
Trained Models
(Python)
Raw Data Stream
Srructured Data
Strem
Scored Data
Stream
Kafka Cluster
Confluent Schema Regestry
ConfluentConnectors
ConfluentConnectors
Casfcation,
Forecasting,
Clusterization
UnstructuredData(Tex)
StructuredData
Kubernetes Cluster
Unstructured Data
(Text)
Structured Data
Events
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Internet
Cassandra in ML Cluster

Cluster and results
Kafka Cluster
Nodes: 6
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 100Gb
Topics: 3
Partitions: 6
Replication Factor: 3
Producers: 6
Average message size: 1Kb
440 000 messages / second
Cassandra Cluster
Nodes: 12
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 800Gb
Replication Factor: 3
Average write latency: 9 ms
Average read latency: 52 ms

Lessons learned
Design you DB carefully at the beginning for queries
Cassandra is not RDBMS, select by partition keys
Deep understanding of internals
Compaction is Hell.
Eventual consistency.
Were is my Disk Space?
Very Expensive! (lots of nodes)

www.eleks.comwww.eleks.com
Q&A
Vitalii Bondarenko
vitaliy.bondarenko@eleks.com

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches."

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches."

Similar to Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches." (20)

More from Fwdays

More from Fwdays (20)

Recently uploaded

Recently uploaded (20)

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches."