SlideShare a Scribd company logo
1 of 24
Vitalii Bondarenko
Eleks
Fast Data Platform
for Real-Time Analytics. Architecture And Approaches
Agenda
 Fast data vs Big Data
 Kafka overview
 Cassandra Internals and Programming
 Architectures and Approaches
 Lessons learned
Big Data Approach
RDBMS Approach
• Massive Parallel Processing (Scalability)
• In-memory DB (Streaming and
compressing)
• Colum stores (BI)
Big Data Approach
• Hadoop (HDFS + MapReduce)
• SQL on HDFS
• Scalable NoSQL
• Batch issue
HDFS (Hadoop Distributed File System)
• Data is spited into blocks and
distributed across the nodes
• Nodes are cheap
• Block size is 64 or 124 MB
• Replication
• Files are typical not updated
• Read data from the beginning
to the end
• Smaller number of larger files
Lambda architecture
Batch & Stream Processing
• Batch layer
• Stores master dataset
• Compute arbitrary views
• Horizontally Scalable
• Speed layer (Streaming)
• Fast, incremental algorithms
• Batch layer eventually overrides speed
layer
• Serving layer
• Random access to batch views
• Updated by batch and Streaming layer
Kappa architecture
Stream Processing with Scalable Storages
• Everything is a stream
• Immutable unstructured data sources
• Single analytics framework
• Windows on Streaming Layer
• Linearly scalable Serving Layer
• Interactive querying
Azure Streaming Analytics
• Easy to use
• Scalable
• Connectivity
• SQL, UDF, Reference Data
Fast Data Platform
• Real-time processing
• In-memory analytics
Fog Computing /
Service Bus
Kafka
Cluster
• Row Data fast writing
• Scalable
Connectors(Source)
Connectors(Sink)
Stream
Stream
Stream
Tasks
Stream
Tasks
. . .
Hadoop Cluster Cheep Data StorageBI Tools
Cassandra
Cluster
Cassandra Replication
spar
k
spar
k
spar
k
spar
k
spar
k
spar
k
1
2
3
4
5
n
1
2
3
4
5
nCassandra
Cluster
Write-Heavy
Stream
Analytics
Row Data
Cassandra
Cluster
Analytics
Apache Kafka
• From LinkedIn, Open Source from 2012
• Service Bus
• Small messages (events)
• Scalable Broker System
• Durable and Distributed
• Very fast (parallelism on partitions)
• No removes from queue, retention
• Streaming processing capability
LinkedIn:
• 1400 brokers
• 13M+ messages/sec
• 2.75GB per second
Brokers, Topics
• Distributed Service Bus
• Broker as virtual servers
• Topics as logical data storage
Writes and reads
• Append Only
• Commit log
• Consumer offset (from beginning)
• Commit read to Kafka topic _consumer_offsets
• Commit when read data
• Retention period 7 days
• Parallel read with consumer groups
• Confluent Platform
Data Sources Enterprise Analytic Suite Data Destination
Clickstream
Shopping
Behaviour
Orders
Purchases
Inventory
Routs
Catalog
Prices
Campaigns
Customers
LocationsDATA LAKE
FAST
DATA
ESB
DWH
Cost Optimization
Profit Maximization
Utility Maximization
DATA LAKE
FAST
DATA
ESB
DWH
Data Visualization
Dashboard
Service Bus
Kafka
Machine Learning Cluster Kubernetes /OpenShift
Kafka Stream API
DataConnectorsMapping
DataConnectorsMapping
Capacity
Planing
Product
Recommendation
Customer
Segmentation
Price
Recommendation
Routs
Optimization
Demand
Prediction
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
Docker
Containers
ML Cluster on Streaming Data
Apache Cassandra
• Multi-master, low-latency, shared nothing
• Distributed
• No single point of failure
• Linearly Scalable
• Multi-datacenter configuration
• AP with tunable consistency
Nodes and distributions
• Data Centers and Racks, Gossip (each 1 sec)
• Distributed by Tokens from -2^63 to 2^63-1
• Hash from partition key. Murmur3
• Virtual Nodes
Coordinator Nodes
• Replication Strategy (SimpleStrategy,
NetworkTopologyStrategy)
• Replication factor (usually 3)
• Consistency Levels (One, Two, Three, Any, All,
Quorum, Local_Quorum, Local_One…)
• Tunable consistency, strong and eventural
• (R +W) > N
Cassandra Objects
Column, which is a name/value pair
Row, which is a container for columns referenced
by a primary key
Table, which is a container for rows
Keyspace, which is a container for tables
Cluster, which is a container for keyspaces that
spans one or more nodes
CQL (Cassandra Query Language)
CREATE TABLE loads (
machine inet,
cpu int,
mtime timeuuid,
load float,
PRIMARY KEY ((machine, cpu), mtime)
) WITH CLUSTERING ORDER BY (mtime DESC);
Internals
Memtable corresponds to CQL Tale
Commit Logs all data for data restoring
SSTables for data in immutable saves
Key Caches caching map of partitions keys
Row Caches for read access speed up
Hints for write request for failed nodes
Tombstones for deleted rows
TTL for deleting rows
Updates are Inserts
Inserts are Updates
Compaction for merging SSTables
CQL (Cassandra Query Language)
• Similar to SQL
• No Joins
• Keyspaces with replication factor
• Inserts vs Updates
• TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/
• DELETE is INSERT
• Ordering and Filtering is not working sometimes (always use partition key)
/* Select Data within a range */
SELECT * FROM myTable WHERE myField > 5000 AND myField < 100000;
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.
Data Modeling: Query-First Design
Cassandra is not
A Data Lake
A Data Ocean
A Data Pond
A Data Warehouse
A In-memory Database
A Key-value store
A Magic database unicorn that fairs rainbow
Data
Center
Cassandra + Spark Cluster
Unstructed and
Structured
Data
Operational Data
Spark SQL
Dashboards
(Pentaho)
Model Traning
Framework (Python,
Anaconda, jupyter
BI Tools
(Power BI)
Hadoop Cluster
Scalable DWH
ML Results
Historial Data
HDFS
Spark
Hive
Redshift /
Postgres XL
Enterprise integration
Enterprise
Applications
ML ResultsESB
(WSO2)
ScoredData
Kubernetes Cluster
Transformation
Rules REDIS
Trained Models
REDIS
Kafka Sreams
API
Transformation
Kafka Sreams
API
ML Scoring
REST API
(Flask)
Trained Models
(Python)
Raw Data Stream
Srructured Data
Strem
Scored Data
Stream
Kafka Cluster
Confluent Schema Regestry
ConfluentConnectors
ConfluentConnectors
Casfcation,
Forecasting,
Clusterization
UnstructuredData(Tex)
StructuredData
Kubernetes Cluster
Unstructured Data
(Text)
Structured Data
Events
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Producers
Crawl/Fetch App
Internet
Cassandra in ML Cluster
Cluster and results
Kafka Cluster
Nodes: 6
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 100Gb
Topics: 3
Partitions: 6
Replication Factor: 3
Producers: 6
Average message size: 1Kb
440 000 messages / second
Cassandra Cluster
Nodes: 12
Amazon instance type: m4.2xlarge
CPU: 8
Memory: 32 Gb
SSD: 800Gb
Replication Factor: 3
Average write latency: 9 ms
Average read latency: 52 ms
Lessons learned
Design you DB carefully at the beginning for queries
Cassandra is not RDBMS, select by partition keys
Deep understanding of internals
Compaction is Hell.
Eventual consistency.
Were is my Disk Space?
Very Expensive! (lots of nodes)
www.eleks.comwww.eleks.com
Q&A
Vitalii Bondarenko
vitaliy.bondarenko@eleks.com

More Related Content

What's hot

HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track3-2:  HBase at China TelecomHBaseConAsia2018 Track3-2:  HBase at China Telecom
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...Michael Stack
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2HBaseCon
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...Michael Stack
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2MongoDB
 
HBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanHBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanMichael Stack
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackElasticsearch
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewPierre Baillet
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...Michael Stack
 

What's hot (20)

HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track3-2:  HBase at China TelecomHBaseConAsia2018 Track3-2:  HBase at China Telecom
HBaseConAsia2018 Track3-2: HBase at China Telecom
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
Cassandra in e-commerce
Cassandra in e-commerceCassandra in e-commerce
Cassandra in e-commerce
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
 
Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2Webinar : Nouveautés de MongoDB 3.2
Webinar : Nouveautés de MongoDB 3.2
 
HBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanHBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at Meituan
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic Stack
 
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 

Similar to Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches."

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Lviv Startup Club
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudCAMMS
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Ai big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenkoAi big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenkoOlga Zinkevych
 
Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"DataConf
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehousehadoopsphere
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullyMd Kamaruzzaman
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Azure DocumentDB Overview
Azure DocumentDB OverviewAzure DocumentDB Overview
Azure DocumentDB OverviewAndrew Liu
 

Similar to Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches." (20)

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Data engineering
Data engineeringData engineering
Data engineering
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Move your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in CloudMove your on prem data to a lake in a Lake in Cloud
Move your on prem data to a lake in a Lake in Cloud
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Ai big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenkoAi big dataconference_ml_fastdata_vitalii bondarenko
Ai big dataconference_ml_fastdata_vitalii bondarenko
 
Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"Vitalii Bondarenko "Machine Learning on Fast Data"
Vitalii Bondarenko "Machine Learning on Fast Data"
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Azure DocumentDB Overview
Azure DocumentDB OverviewAzure DocumentDB Overview
Azure DocumentDB Overview
 

More from Fwdays

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...Fwdays
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil TopchiiFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro SpodaretsFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym KindritskyiFwdays
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...Fwdays
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...Fwdays
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...Fwdays
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...Fwdays
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...Fwdays
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...Fwdays
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...Fwdays
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...Fwdays
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra MyronovaFwdays
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...Fwdays
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...Fwdays
 

More from Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture And Approaches."

  • 1. Vitalii Bondarenko Eleks Fast Data Platform for Real-Time Analytics. Architecture And Approaches
  • 2. Agenda  Fast data vs Big Data  Kafka overview  Cassandra Internals and Programming  Architectures and Approaches  Lessons learned
  • 3. Big Data Approach RDBMS Approach • Massive Parallel Processing (Scalability) • In-memory DB (Streaming and compressing) • Colum stores (BI) Big Data Approach • Hadoop (HDFS + MapReduce) • SQL on HDFS • Scalable NoSQL • Batch issue
  • 4. HDFS (Hadoop Distributed File System) • Data is spited into blocks and distributed across the nodes • Nodes are cheap • Block size is 64 or 124 MB • Replication • Files are typical not updated • Read data from the beginning to the end • Smaller number of larger files
  • 5. Lambda architecture Batch & Stream Processing • Batch layer • Stores master dataset • Compute arbitrary views • Horizontally Scalable • Speed layer (Streaming) • Fast, incremental algorithms • Batch layer eventually overrides speed layer • Serving layer • Random access to batch views • Updated by batch and Streaming layer
  • 6. Kappa architecture Stream Processing with Scalable Storages • Everything is a stream • Immutable unstructured data sources • Single analytics framework • Windows on Streaming Layer • Linearly scalable Serving Layer • Interactive querying
  • 7. Azure Streaming Analytics • Easy to use • Scalable • Connectivity • SQL, UDF, Reference Data
  • 8. Fast Data Platform • Real-time processing • In-memory analytics Fog Computing / Service Bus Kafka Cluster • Row Data fast writing • Scalable Connectors(Source) Connectors(Sink) Stream Stream Stream Tasks Stream Tasks . . . Hadoop Cluster Cheep Data StorageBI Tools Cassandra Cluster Cassandra Replication spar k spar k spar k spar k spar k spar k 1 2 3 4 5 n 1 2 3 4 5 nCassandra Cluster Write-Heavy Stream Analytics Row Data Cassandra Cluster Analytics
  • 9. Apache Kafka • From LinkedIn, Open Source from 2012 • Service Bus • Small messages (events) • Scalable Broker System • Durable and Distributed • Very fast (parallelism on partitions) • No removes from queue, retention • Streaming processing capability LinkedIn: • 1400 brokers • 13M+ messages/sec • 2.75GB per second
  • 10. Brokers, Topics • Distributed Service Bus • Broker as virtual servers • Topics as logical data storage
  • 11. Writes and reads • Append Only • Commit log • Consumer offset (from beginning) • Commit read to Kafka topic _consumer_offsets • Commit when read data • Retention period 7 days • Parallel read with consumer groups • Confluent Platform
  • 12. Data Sources Enterprise Analytic Suite Data Destination Clickstream Shopping Behaviour Orders Purchases Inventory Routs Catalog Prices Campaigns Customers LocationsDATA LAKE FAST DATA ESB DWH Cost Optimization Profit Maximization Utility Maximization DATA LAKE FAST DATA ESB DWH Data Visualization Dashboard Service Bus Kafka Machine Learning Cluster Kubernetes /OpenShift Kafka Stream API DataConnectorsMapping DataConnectorsMapping Capacity Planing Product Recommendation Customer Segmentation Price Recommendation Routs Optimization Demand Prediction Docker Containers Docker Containers Docker Containers Docker Containers Docker Containers Docker Containers ML Cluster on Streaming Data
  • 13. Apache Cassandra • Multi-master, low-latency, shared nothing • Distributed • No single point of failure • Linearly Scalable • Multi-datacenter configuration • AP with tunable consistency
  • 14. Nodes and distributions • Data Centers and Racks, Gossip (each 1 sec) • Distributed by Tokens from -2^63 to 2^63-1 • Hash from partition key. Murmur3 • Virtual Nodes
  • 15. Coordinator Nodes • Replication Strategy (SimpleStrategy, NetworkTopologyStrategy) • Replication factor (usually 3) • Consistency Levels (One, Two, Three, Any, All, Quorum, Local_Quorum, Local_One…) • Tunable consistency, strong and eventural • (R +W) > N
  • 16. Cassandra Objects Column, which is a name/value pair Row, which is a container for columns referenced by a primary key Table, which is a container for rows Keyspace, which is a container for tables Cluster, which is a container for keyspaces that spans one or more nodes CQL (Cassandra Query Language) CREATE TABLE loads ( machine inet, cpu int, mtime timeuuid, load float, PRIMARY KEY ((machine, cpu), mtime) ) WITH CLUSTERING ORDER BY (mtime DESC);
  • 17. Internals Memtable corresponds to CQL Tale Commit Logs all data for data restoring SSTables for data in immutable saves Key Caches caching map of partitions keys Row Caches for read access speed up Hints for write request for failed nodes Tombstones for deleted rows TTL for deleting rows Updates are Inserts Inserts are Updates Compaction for merging SSTables
  • 18. CQL (Cassandra Query Language) • Similar to SQL • No Joins • Keyspaces with replication factor • Inserts vs Updates • TTL INSERT INTO myTable (id, myField) VALUES (2, 9) USING TTL 86400; /*24H*/ • DELETE is INSERT • Ordering and Filtering is not working sometimes (always use partition key) /* Select Data within a range */ SELECT * FROM myTable WHERE myField > 5000 AND myField < 100000; Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.
  • 20. Cassandra is not A Data Lake A Data Ocean A Data Pond A Data Warehouse A In-memory Database A Key-value store A Magic database unicorn that fairs rainbow
  • 21. Data Center Cassandra + Spark Cluster Unstructed and Structured Data Operational Data Spark SQL Dashboards (Pentaho) Model Traning Framework (Python, Anaconda, jupyter BI Tools (Power BI) Hadoop Cluster Scalable DWH ML Results Historial Data HDFS Spark Hive Redshift / Postgres XL Enterprise integration Enterprise Applications ML ResultsESB (WSO2) ScoredData Kubernetes Cluster Transformation Rules REDIS Trained Models REDIS Kafka Sreams API Transformation Kafka Sreams API ML Scoring REST API (Flask) Trained Models (Python) Raw Data Stream Srructured Data Strem Scored Data Stream Kafka Cluster Confluent Schema Regestry ConfluentConnectors ConfluentConnectors Casfcation, Forecasting, Clusterization UnstructuredData(Tex) StructuredData Kubernetes Cluster Unstructured Data (Text) Structured Data Events Producers Crawl/Fetch App Producers Crawl/Fetch App Producers Crawl/Fetch App Internet Cassandra in ML Cluster
  • 22. Cluster and results Kafka Cluster Nodes: 6 Amazon instance type: m4.2xlarge CPU: 8 Memory: 32 Gb SSD: 100Gb Topics: 3 Partitions: 6 Replication Factor: 3 Producers: 6 Average message size: 1Kb 440 000 messages / second Cassandra Cluster Nodes: 12 Amazon instance type: m4.2xlarge CPU: 8 Memory: 32 Gb SSD: 800Gb Replication Factor: 3 Average write latency: 9 ms Average read latency: 52 ms
  • 23. Lessons learned Design you DB carefully at the beginning for queries Cassandra is not RDBMS, select by partition keys Deep understanding of internals Compaction is Hell. Eventual consistency. Were is my Disk Space? Very Expensive! (lots of nodes)