We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
2. Agenda
Fast data vs Big Data
Kafka overview
Cassandra Internals and Programming
Architectures and Approaches
Lessons learned
3. Big Data Approach
RDBMS Approach
• Massive Parallel Processing (Scalability)
• In-memory DB (Streaming and
compressing)
• Colum stores (BI)
Big Data Approach
• Hadoop (HDFS + MapReduce)
• SQL on HDFS
• Scalable NoSQL
• Batch issue
4. HDFS (Hadoop Distributed File System)
• Data is spited into blocks and
distributed across the nodes
• Nodes are cheap
• Block size is 64 or 124 MB
• Replication
• Files are typical not updated
• Read data from the beginning
to the end
• Smaller number of larger files
6. Kappa architecture
Stream Processing with Scalable Storages
• Everything is a stream
• Immutable unstructured data sources
• Single analytics framework
• Windows on Streaming Layer
• Linearly scalable Serving Layer
• Interactive querying
8. Fast Data Platform
• Real-time processing
• In-memory analytics
Fog Computing /
Service Bus
Kafka
Cluster
• Row Data fast writing
• Scalable
Connectors(Source)
Connectors(Sink)
Stream
Stream
Stream
Tasks
Stream
Tasks
. . .
Hadoop Cluster Cheep Data StorageBI Tools
Cassandra
Cluster
Cassandra Replication
spar
k
spar
k
spar
k
spar
k
spar
k
spar
k
1
2
3
4
5
n
1
2
3
4
5
nCassandra
Cluster
Write-Heavy
Stream
Analytics
Row Data
Cassandra
Cluster
Analytics
9. Apache Kafka
• From LinkedIn, Open Source from 2012
• Service Bus
• Small messages (events)
• Scalable Broker System
• Durable and Distributed
• Very fast (parallelism on partitions)
• No removes from queue, retention
• Streaming processing capability
LinkedIn:
• 1400 brokers
• 13M+ messages/sec
• 2.75GB per second
11. Writes and reads
• Append Only
• Commit log
• Consumer offset (from beginning)
• Commit read to Kafka topic _consumer_offsets
• Commit when read data
• Retention period 7 days
• Parallel read with consumer groups
• Confluent Platform
12. Data Sources Enterprise Analytic Suite Data Destination
Clickstream
Shopping
Behaviour
Orders
Purchases
Inventory
Routs
Catalog
Prices
Campaigns
Customers
LocationsDATA LAKE
FAST
DATA
ESB
DWH
Cost Optimization
Profit Maximization
Utility Maximization
DATA LAKE
FAST
DATA
ESB
DWH
Data Visualization
Dashboard
Service Bus
Kafka
Machine Learning Cluster Kubernetes /OpenShift
Kafka Stream API
DataConnectorsMapping
DataConnectorsMapping
Capacity
Planing
Product
Recommendation
Customer
Segmentation
Price
Recommendation
Routs
Optimization
Demand
Prediction
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
ML Cluster on Streaming Data
13. Apache Cassandra
• Multi-master, low-latency, shared nothing
• Distributed
• No single point of failure
• Linearly Scalable
• Multi-datacenter configuration
• AP with tunable consistency
14. Nodes and distributions
• Data Centers and Racks, Gossip (each 1 sec)
• Distributed by Tokens from -2^63 to 2^63-1
• Hash from partition key. Murmur3
• Virtual Nodes
16. Cassandra Objects
Column, which is a name/value pair
Row, which is a container for columns referenced
by a primary key
Table, which is a container for rows
Keyspace, which is a container for tables
Cluster, which is a container for keyspaces that
spans one or more nodes
CQL (Cassandra Query Language)
CREATE TABLE loads (
machine inet,
cpu int,
mtime timeuuid,
load float,
PRIMARY KEY ((machine, cpu), mtime)
) WITH CLUSTERING ORDER BY (mtime DESC);
17. Internals
Memtable corresponds to CQL Tale
Commit Logs all data for data restoring
SSTables for data in immutable saves
Key Caches caching map of partitions keys
Row Caches for read access speed up
Hints for write request for failed nodes
Tombstones for deleted rows
TTL for deleting rows
Updates are Inserts
Inserts are Updates
Compaction for merging SSTables
18. CQL (Cassandra Query Language)
• Similar to SQL
• No Joins
• Keyspaces with replication factor
• Inserts vs Updates
• TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/
• DELETE is INSERT
• Ordering and Filtering is not working sometimes (always use partition key)
/* Select Data within a range */
SELECT * FROM myTable WHERE myField > 5000 AND myField < 100000;
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.
20. Cassandra is not
A Data Lake
A Data Ocean
A Data Pond
A Data Warehouse
A In-memory Database
A Key-value store
A Magic database unicorn that fairs rainbow
21. Data
Center
Cassandra + Spark Cluster
Unstructed and
Structured
Data
Operational Data
Spark SQL
Dashboards
(Pentaho)
Model Traning
Framework (Python,
Anaconda, jupyter
BI Tools
(Power BI)
Hadoop Cluster
Scalable DWH
ML Results
Historial Data
HDFS
Spark
Hive
Redshift /
Postgres XL
Enterprise integration
Enterprise
Applications
ML ResultsESB
(WSO2)
ScoredData
Kubernetes Cluster
Transformation
Rules REDIS
Trained Models
REDIS
Kafka Sreams
API
Transformation
Kafka Sreams
API
ML Scoring
REST API
(Flask)
Trained Models
(Python)
Raw Data Stream
Srructured Data
Strem
Scored Data
Stream
Kafka Cluster
Confluent Schema Regestry
ConfluentConnectors
ConfluentConnectors
Casfcation,
Forecasting,
Clusterization
UnstructuredData(Tex)
StructuredData
Kubernetes Cluster
Unstructured Data
(Text)
Structured Data
Events
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Internet
Cassandra in ML Cluster
22. Cluster and results
Kafka Cluster
Nodes: 6
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 100Gb
Topics: 3
Partitions: 6
Replication Factor: 3
Producers: 6
Average message size: 1Kb
440 000 messages / second
Cassandra Cluster
Nodes: 12
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 800Gb
Replication Factor: 3
Average write latency: 9 ms
Average read latency: 52 ms
23. Lessons learned
Design you DB carefully at the beginning for queries
Cassandra is not RDBMS, select by partition keys
Deep understanding of internals
Compaction is Hell.
Eventual consistency.
Were is my Disk Space?
Very Expensive! (lots of nodes)