SlideShare a Scribd company logo
1 of 40
Real-Time Big Data with
Storm, Kafka and GigaSpaces.
Building own Real-Time
Google Analytics
Oleksiy Dyagilev
Lead Software Engineer
Epam Systems
Real-time
• must guarantee response
within strict time constraints
• deadline must be
met regardless of system load
ABS (anti-lock brakes )
railway switching system
chess playing program
Near Real-time
• time delay introduced by
data processing or
network transmission
• "no significant delays"
video streaming
analytical applications
Big Data
• data sets so large and complex
that it becomes difficult to
process using traditional data
processing applications
• volume (terabytes, petabytes)
• velocity (speed of data in and
out)
• variety (various data sources,
structured and unstructured)
IoT(Internet of Things)
mobile devices
sensor data (meteo, genomics, geo, bio, etc)
social media
Internet search
user activity tracking
software logs
Typical design
(simplified lambda architecture)
Questions?
Kafka, not Franz
A high-throughput distributed messaging
system
• fast, O(1) persistence
• scalable
• durable
• distributed by design
• originally developed by LinkedIn
• written in Scala
Kafka. Commit log.
• ordered, immutable sequence of
messages that is continually
appended to
• retention
• read offset controlled by consumer
• partitions distributed over cluster
• replication (leader, followers)
• messages load balanced by
message key or in round-robin
Kafka. Internals
• disks are slow! .. oh wait ...
• random access vs sequential matters a lot
• 6*7200rpm SATA RAID-5 array. Random writes - 100k/sec. Sequential -
600MB/sec [1]
• according to ACM Queue article in some cases seq. write of eight 15000
rpm SAS disks in RAID-5 can be faster than memory random access (mind
artice date!)
Kafka. Internals
• OS pagecache, read-ahead, write-behind
• flush after K seconds or N messages
• no need to delete data(comparing to in-memory solutions)
• no need to keep state what has been consumed (controlled by consumer,
one integer per partition in ZK)
• no GC penalties
• no overhead for JVM objects
• batching
• end-to-end batch compression
• linux sendfile() system call: eliminate context switches and memory copy
(nio.FileChannel.transferTo())
Zero copy
each time data traverses the user-kernel boundary, it must be copied, which
consumes CPU cycles and memory bandwidth. Benchmark shows that time
reduced in more than 2x with zero copy.
java.nio.channels.FileChannel.transferTo() available on linux
traditional approach Zero copy
Kafka. Ordering guarantees.
• Messages sent by a producer to a particular topic partition will be
appended in the order they are sent.
• Messages delivered asynchronously to consumers, so may arrive out of
order on different consumers
• Kafka assigns partition to consumer to guarantee ordering within
partition, so that partition consumed by 1 consumer in the group.
• No global ordering across partitions, the only way is to have single
partition and single consumer which doesn't scale.
Kafka. Delivery semantics
Producer:
• synchronous
• asynchronous
• wait for replication or not
To guarantee exactly-once:
• include PK and deduplicate on consumer
• or single writer per partition and check last message in case of network error
Consumer:
• at least once: read, process, commit position.
• at most once: read, commit position, process.
• exactly-once: need application level logic, keep offset together with data
LinkedIn benchmarksEnvironment:
• 6 machines: Kafka on 3, other 3 for Zookeeper and client
• Intel Xeon 2.5 GHz processor with six cores
• Six 7200 RPM SATA drives, JBOD - no RAID
• 32GB of RAM
• 1Gb Ethernet
• 6 partitions, record 100 bytes
Result:
• 1 producer, no replica - 821,557 records/sec (78.3 MB/sec)
• 1 producer, 3x async replica - 786,980 records/sec (75.1 MB/sec)
• 1 producer, 3x sync replica - 421,823 records/sec (40.2 MB/sec)
• 3 producers, 3x async replication - 2,024,032 records/sec (193.0 MB/sec)
• 1 consumer - 940,521 records/sec (89.7 MB/sec)
• 3 consumers - 2,615,968 records/sec (249.5 MB/sec)
• 1 producer and 1 consumer - 795,064 records/sec (75.8 MB/sec)
Questions?
Apache Storm
Free and open source distributed realtime computation system. Storm
makes it easy to reliably process unbounded streams of data, doing for
realtime processing what Hadoop did for batch processing
• scalable
• fault-tolerant
• guarantees data will be processed
• originally written by Nathan Marz in Java/Clojure, then adopted by Twitter
• now in Apache incubator
• used by dozens of companies
• active community
Storm in a nutshell
• Topology - computation graph
• Tuple - unit of data, sequence of fields
• Stream - unbounded sequence of tuples
• Spout - input data source
• Bolt - processing node
• Stream grouping (field, shuffle, all, global)
• DRPC - distributed RPC, ad-hoc queries
• Trident - framework on top of Storm for stateful, incremental processing on
top of persistence store
Storm cluster.
• Nimbus distributes code around cluster. SPOF.
Stateless, fail-fast
• Supervisor starts/stops workers. Stateless,
fail-fast
• Worker executes subset of topology
• coordination between Nimbus and Supervisor
is done through Zookeeper
Word Counter. Topology
Word Counter. Topology
class SplitSentence extends BaseBasicBolt {
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String string = tuple.getString(0);
for (String word : string.split(" ")) {
collector.emit(new Values(word));
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
Storm. Word counter.
class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
Storm. Word counter.
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Config conf = new Config();
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar("word-counter", conf, builder.createTopology());
}
Message processing guarantees
• every tuple will be processed at least once. Use Trident for exactly-once
• when tuple created it's given random 64 bit id
• every tuple knows the ids of all the spout tuples for which it exists in their tuple trees(information copied from anchors when new tuple emitted in bolt)
• when a tuple is acked, it sends a message to the appropriate acker tasks with information about how the tuple tree changed
• mod hashing used to map a spout tuple id to an acker task
• acker task stores a map from a spout tuple id to a pair of values (spout id, ack val). Ack val is 64 bit = XOR of all tuple ids, represents state of the tree
• ack val = 0 means that tree is fully processed
• at 10K acks per second, it will take 50,000,000 years until a mistake is made. Will cause data loss only if tuple fails
Trident
Trident is a high-level abstraction for doing stateful,
incremental processing on top of persistence store
• tuples are processed as small batches
• exactly-once processing
• “transactional” datastore persistence
• functional API: joins, aggregations, grouping, functions, and filters
• compiles into as efficient of a Storm topology as possible
Word counter with Trident
TridentState wordCounts = topology
.newStream("spout1", spout).parallelismHint(16)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(16);
topology.newDRPCStream("words", drpc)
.each(new Fields("args"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count"))
.each(new Fields("count"), new FilterNull())
.aggregate(new Fields("count"), new Sum(), new Fields("sum"));
Trident topology compilation
Trident topology compilation
How Trident guarantees exactly-once
semantics?
• Each batch of tuples is given a unique id called the “transaction id” (txid). If the batch is replayed, it
is given the exact same txid.
• State updates are ordered among batches. That is, the state updates for batch 3 won’t be applied
until the state updates for batch 2 have succeeded. Note: pipelining
Consider 'transactional' spout:
1. Batches for a given txid are always the same. Replays of batches for a txid will exact same set of
tuples as the first time that batch was emitted for that txid.
2. There’s no overlap between batches of tuples (tuples are in one batch or another, never
multiple).
3. Every tuple is in a batch (no tuples are skipped)
Transactional State
Questions?
Realtime Google Analytics
highly scalable equivalent of Realtime Google Analytics on top of Storm and GigaSpaces.
Application can be deployed to cloud with one click using Cloudify
Code available on github https://github.com/fe2s/xap-storm
Live demo
• web
• xap
• storm ui
• cloudify
High-level architecture
{
“sessionId”: “sessionid581239234”,
“referral”: “https://www.google.com/#q=gigaspace”,
“page”: “http://www.gigaspaces.com/about”,
“ip”: “89.162.139.2”
}
PageView:
Simple Storm spout
Trident Spout
Google Analytics Topology.
Top urls branch
Active users branch
Page view time series
Geo branch
Thanks!
Presentation and detailed blog post available at http://dyagilev.org
Resources:
• Kafka benchmarking
• The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads
• Kafka documentation
• The Lambda architecture: principles for architecting realtime Big Data systems
• Efficient data transfer through zero copy
• RabbitMQ Performance Measurements
• GigaSpaces and Storm Integration

More Related Content

What's hot

Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentationGabriel Eisbruch
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentDataWorks Summit/Hadoop Summit
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases MongoDB
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperKarthik Ramasamy
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 

What's hot (20)

Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Spanner
SpannerSpanner
Spanner
 
Google Spanner
Google SpannerGoogle Spanner
Google Spanner
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paper
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Spanner osdi2012
Spanner osdi2012Spanner osdi2012
Spanner osdi2012
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 

Viewers also liked

Webinar Google Analytics Real Time MA 22-11-11
Webinar Google Analytics Real Time MA 22-11-11Webinar Google Analytics Real Time MA 22-11-11
Webinar Google Analytics Real Time MA 22-11-11Watt
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solrTrey Grainger
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
An Introduction to Distributed Search with Cassandra and Solr
An Introduction to Distributed Search with Cassandra and SolrAn Introduction to Distributed Search with Cassandra and Solr
An Introduction to Distributed Search with Cassandra and SolrDataStax Academy
 

Viewers also liked (10)

Webinar Google Analytics Real Time MA 22-11-11
Webinar Google Analytics Real Time MA 22-11-11Webinar Google Analytics Real Time MA 22-11-11
Webinar Google Analytics Real Time MA 22-11-11
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solr
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
An Introduction to Distributed Search with Cassandra and Solr
An Introduction to Distributed Search with Cassandra and SolrAn Introduction to Distributed Search with Cassandra and Solr
An Introduction to Distributed Search with Cassandra and Solr
 

Similar to Real-Time Big Data with Storm, Kafka and GigaSpaces

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 

Similar to Real-Time Big Data with Storm, Kafka and GigaSpaces (20)

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 

Recently uploaded

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 

Recently uploaded (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 

Real-Time Big Data with Storm, Kafka and GigaSpaces

  • 1. Real-Time Big Data with Storm, Kafka and GigaSpaces. Building own Real-Time Google Analytics Oleksiy Dyagilev Lead Software Engineer Epam Systems
  • 2. Real-time • must guarantee response within strict time constraints • deadline must be met regardless of system load ABS (anti-lock brakes ) railway switching system chess playing program Near Real-time • time delay introduced by data processing or network transmission • "no significant delays" video streaming analytical applications
  • 3. Big Data • data sets so large and complex that it becomes difficult to process using traditional data processing applications • volume (terabytes, petabytes) • velocity (speed of data in and out) • variety (various data sources, structured and unstructured) IoT(Internet of Things) mobile devices sensor data (meteo, genomics, geo, bio, etc) social media Internet search user activity tracking software logs
  • 6. Kafka, not Franz A high-throughput distributed messaging system • fast, O(1) persistence • scalable • durable • distributed by design • originally developed by LinkedIn • written in Scala
  • 7. Kafka. Commit log. • ordered, immutable sequence of messages that is continually appended to • retention • read offset controlled by consumer • partitions distributed over cluster • replication (leader, followers) • messages load balanced by message key or in round-robin
  • 8. Kafka. Internals • disks are slow! .. oh wait ... • random access vs sequential matters a lot • 6*7200rpm SATA RAID-5 array. Random writes - 100k/sec. Sequential - 600MB/sec [1] • according to ACM Queue article in some cases seq. write of eight 15000 rpm SAS disks in RAID-5 can be faster than memory random access (mind artice date!)
  • 9. Kafka. Internals • OS pagecache, read-ahead, write-behind • flush after K seconds or N messages • no need to delete data(comparing to in-memory solutions) • no need to keep state what has been consumed (controlled by consumer, one integer per partition in ZK) • no GC penalties • no overhead for JVM objects • batching • end-to-end batch compression • linux sendfile() system call: eliminate context switches and memory copy (nio.FileChannel.transferTo())
  • 10. Zero copy each time data traverses the user-kernel boundary, it must be copied, which consumes CPU cycles and memory bandwidth. Benchmark shows that time reduced in more than 2x with zero copy. java.nio.channels.FileChannel.transferTo() available on linux traditional approach Zero copy
  • 11. Kafka. Ordering guarantees. • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. • Messages delivered asynchronously to consumers, so may arrive out of order on different consumers • Kafka assigns partition to consumer to guarantee ordering within partition, so that partition consumed by 1 consumer in the group. • No global ordering across partitions, the only way is to have single partition and single consumer which doesn't scale.
  • 12. Kafka. Delivery semantics Producer: • synchronous • asynchronous • wait for replication or not To guarantee exactly-once: • include PK and deduplicate on consumer • or single writer per partition and check last message in case of network error Consumer: • at least once: read, process, commit position. • at most once: read, commit position, process. • exactly-once: need application level logic, keep offset together with data
  • 13. LinkedIn benchmarksEnvironment: • 6 machines: Kafka on 3, other 3 for Zookeeper and client • Intel Xeon 2.5 GHz processor with six cores • Six 7200 RPM SATA drives, JBOD - no RAID • 32GB of RAM • 1Gb Ethernet • 6 partitions, record 100 bytes Result: • 1 producer, no replica - 821,557 records/sec (78.3 MB/sec) • 1 producer, 3x async replica - 786,980 records/sec (75.1 MB/sec) • 1 producer, 3x sync replica - 421,823 records/sec (40.2 MB/sec) • 3 producers, 3x async replication - 2,024,032 records/sec (193.0 MB/sec) • 1 consumer - 940,521 records/sec (89.7 MB/sec) • 3 consumers - 2,615,968 records/sec (249.5 MB/sec) • 1 producer and 1 consumer - 795,064 records/sec (75.8 MB/sec)
  • 15. Apache Storm Free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing • scalable • fault-tolerant • guarantees data will be processed • originally written by Nathan Marz in Java/Clojure, then adopted by Twitter • now in Apache incubator • used by dozens of companies • active community
  • 16. Storm in a nutshell • Topology - computation graph • Tuple - unit of data, sequence of fields • Stream - unbounded sequence of tuples • Spout - input data source • Bolt - processing node • Stream grouping (field, shuffle, all, global) • DRPC - distributed RPC, ad-hoc queries • Trident - framework on top of Storm for stateful, incremental processing on top of persistence store
  • 17. Storm cluster. • Nimbus distributes code around cluster. SPOF. Stateless, fail-fast • Supervisor starts/stops workers. Stateless, fail-fast • Worker executes subset of topology • coordination between Nimbus and Supervisor is done through Zookeeper
  • 19. Word Counter. Topology class SplitSentence extends BaseBasicBolt { @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String string = tuple.getString(0); for (String word : string.split(" ")) { collector.emit(new Values(word)); } } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }
  • 20. Storm. Word counter. class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
  • 21. Storm. Word counter. public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar("word-counter", conf, builder.createTopology()); }
  • 22. Message processing guarantees • every tuple will be processed at least once. Use Trident for exactly-once • when tuple created it's given random 64 bit id • every tuple knows the ids of all the spout tuples for which it exists in their tuple trees(information copied from anchors when new tuple emitted in bolt) • when a tuple is acked, it sends a message to the appropriate acker tasks with information about how the tuple tree changed • mod hashing used to map a spout tuple id to an acker task • acker task stores a map from a spout tuple id to a pair of values (spout id, ack val). Ack val is 64 bit = XOR of all tuple ids, represents state of the tree • ack val = 0 means that tree is fully processed • at 10K acks per second, it will take 50,000,000 years until a mistake is made. Will cause data loss only if tuple fails
  • 23. Trident Trident is a high-level abstraction for doing stateful, incremental processing on top of persistence store • tuples are processed as small batches • exactly-once processing • “transactional” datastore persistence • functional API: joins, aggregations, grouping, functions, and filters • compiles into as efficient of a Storm topology as possible
  • 24. Word counter with Trident TridentState wordCounts = topology .newStream("spout1", spout).parallelismHint(16) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(16); topology.newDRPCStream("words", drpc) .each(new Fields("args"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .stateQuery(wordCounts, new Fields("word"), new MapGet(), new Fields("count")) .each(new Fields("count"), new FilterNull()) .aggregate(new Fields("count"), new Sum(), new Fields("sum"));
  • 27. How Trident guarantees exactly-once semantics? • Each batch of tuples is given a unique id called the “transaction id” (txid). If the batch is replayed, it is given the exact same txid. • State updates are ordered among batches. That is, the state updates for batch 3 won’t be applied until the state updates for batch 2 have succeeded. Note: pipelining Consider 'transactional' spout: 1. Batches for a given txid are always the same. Replays of batches for a txid will exact same set of tuples as the first time that batch was emitted for that txid. 2. There’s no overlap between batches of tuples (tuples are in one batch or another, never multiple). 3. Every tuple is in a batch (no tuples are skipped)
  • 30. Realtime Google Analytics highly scalable equivalent of Realtime Google Analytics on top of Storm and GigaSpaces. Application can be deployed to cloud with one click using Cloudify Code available on github https://github.com/fe2s/xap-storm
  • 31. Live demo • web • xap • storm ui • cloudify
  • 32. High-level architecture { “sessionId”: “sessionid581239234”, “referral”: “https://www.google.com/#q=gigaspace”, “page”: “http://www.gigaspaces.com/about”, “ip”: “89.162.139.2” } PageView:
  • 38. Page view time series
  • 40. Thanks! Presentation and detailed blog post available at http://dyagilev.org Resources: • Kafka benchmarking • The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads • Kafka documentation • The Lambda architecture: principles for architecting realtime Big Data systems • Efficient data transfer through zero copy • RabbitMQ Performance Measurements • GigaSpaces and Storm Integration