SlideShare a Scribd company logo
1 of 13
Siphon - Kafka as DataBus in Microsoft
Nitin Kumar (nitin.kumar@Microsoft.com)
Dev Manager, Microsoft
https://www.linkedin.com/in/nikuma
Agenda
• Scale: Kafka at Microsoft (Bing, Ads, Office)
• Use Case: NRT Customer facing reports
• Kafka based Streaming Solution
• Collector
• Consumer Restful APIs
• Monitoring: Canary/Audit Trail
• Production Experience
• Key Takeaways
Wednesday, March 16, 2016
Scale: Kafka at Microsoft (Ads, Bing, Office)
Kafka Brokers 1000+ across 5 Datacenters
Operating System Windows Server 2012 R2
Hardware Spec 12 Cores, 32 GB RAM, 4x2 TB HDD (JBOD), 10 GB Network
Incoming Events 1 million per sec, (90 Billion per day, 100 TB per day)
Outgoing Events 5 million per sec, (1 Trillion per day, 500 TB per day)
Kafka Topics/Partitions 50+/5000+
Kafka version 0.8.1.1 (3 way replication)
Wednesday, March 16, 2016
Problem
Wednesday, March 16, 2016
Serving System{Q}
{R}
Online Fraud
Detection
ML
Classification
Aggregation Reporting DB
Keyword
1.5 hours 2.5 hours
Advertiser
Feature
Extraction
300
GB/h
200+
Features
Stats
25 TB
Log
Collection
Sorting /
Partitioning
What is the click through rate of my ad, that launched at 5pm?
Goals / Design Considerations
Wednesday, March 16, 2016
Reduce latency from 4 hours to 15 minutes
99.8% Log completeness Guarantees
Check pointing & Failure recovery
Exactly Once Semantic
Highly Available, Scalable and rolling upgrade
Reusing Existing C# Libraries
Siphon DataBus
Solution
{Q}
{R}
Kafka Audit
ML
Classification
Aggregation Reporting
DB
Keyword
1-2 sec (Minimize latency) < 15 minutes
Advertiser
Feature
Extraction
100
MBPS
200+
Features
Stats
25 TB
Wednesday, March 16, 2016
Serving System
Online Fraud
Detection
Kafka as a distributed Queue
StreamScope as a distributed processing system
StreamScope
Siphon
Wednesday, March 16, 2016
Asia DC
Zookeeper Canary
Kafka
Collector
Agent
Services Data Pull (Agent)
Services Data Push
Device Proxy Services
Consumer
API (Push/
Pull)
Europe DC
Zookeeper Canary
Kafka
US DC
Zookeeper Canary
Kafka
Streaming
Batch
Audit Trail
Open Source
Microsoft Internal
Siphon
Collector – Data Ingestion (Producers)
• Http(s) Server
• Restful API with SSL support.
• Abstraction from Kafka
internals (Partition, Kafka version)
• Throttling, QPS Monitoring
• PII scrubbing
• Load balancing/failover
Device Proxy Services
Collector
Kafka Brokers
Broker
Broker
Broker
Broker
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
Collector
Collector
LoadBalancer
Services Data Push
Agent
Services Data Pull (Agent)
Wednesday, March 16, 2016
Open Source
Microsoft Internal
Siphon
Consumer API (Push/Pull)
• Restful Pull API – Simple consumer
• Config driven subscriptions for preconfigured sinks like (HDFS, Cosmos, ELK).
Wednesday, March 16, 2016
Config (ZK)
Executor
Kafka .NET
Library
Kafka
Supported destinations –
• Cosmos
• Elastic Search
• Kafka
• HDFS
High Level
Consumer
Monitoring using Canary, Audit Trail
Device Proxy Services
Collector
Kafka Brokers
Broker
Broker
Broker
Broker
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
Collector
Collector
LoadBalancer
Services Data Push
Agent
Services Data Pull (Agent)
Wednesday, March 16, 2016
Synthetic
message
Audit Trail
Production Experience
• System in production for 15 months
• End to End Advertiser report latency of 12+ minutes.
• Other use cases from Office, Bing.
• Integration with other streaming systems – Storm, Spark.
• Monitoring using ELK
Wednesday, March 16, 2016
Key Takeaways
• Scale out with Kafka (50K -> 1M -> multi-million Events Per sec)
• Ability to build tunable Auditing/Monitoring
• Producer/Consumer Restful API provides a nice abstraction
• Config driven Pub/Sub system
Wednesday, March 16, 2016
“We are Hiring.”
Thank You
Nitin Kumar (nitin.kumar@Microsoft.com)
https://www.linkedin.com/in/nikuma
Wednesday, March 16, 2016

More Related Content

What's hot

Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Flink Forward
 
Maximilian Michels - Flink and Beam
Maximilian Michels - Flink and BeamMaximilian Michels - Flink and Beam
Maximilian Michels - Flink and BeamFlink Forward
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...confluent
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...confluent
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data AnalyticsAnkur Bansal
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!confluent
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...confluent
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...Flink Forward
 
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern ProgrammingKafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programmingconfluent
 
Putting the Micro into Microservices with Stateful Stream Processing
Putting the Micro into Microservices with Stateful Stream ProcessingPutting the Micro into Microservices with Stateful Stream Processing
Putting the Micro into Microservices with Stateful Stream Processingconfluent
 
Matching the Scale at Tinder with Kafka
Matching the Scale at Tinder with Kafka Matching the Scale at Tinder with Kafka
Matching the Scale at Tinder with Kafka confluent
 
Kafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePayKafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePayconfluent
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Michael Noll
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain PlanningKafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain Planningconfluent
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard confluent
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
 

What's hot (20)

Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
 
Maximilian Michels - Flink and Beam
Maximilian Michels - Flink and BeamMaximilian Michels - Flink and Beam
Maximilian Michels - Flink and Beam
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
 
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern ProgrammingKafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming
 
Putting the Micro into Microservices with Stateful Stream Processing
Putting the Micro into Microservices with Stateful Stream ProcessingPutting the Micro into Microservices with Stateful Stream Processing
Putting the Micro into Microservices with Stateful Stream Processing
 
Matching the Scale at Tinder with Kafka
Matching the Scale at Tinder with Kafka Matching the Scale at Tinder with Kafka
Matching the Scale at Tinder with Kafka
 
Kafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePayKafka Summit SF 2017 - Database Streaming at WePay
Kafka Summit SF 2017 - Database Streaming at WePay
 
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
Introducing Apache Kafka's Streams API - Kafka meetup Munich, Jan 25 2017
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain PlanningKafka Summit SF 2017 - Fast Data in Supply Chain Planning
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
 
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 

Viewers also liked

Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to KafkaDucas Francis
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan confluent
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystemconfluent
 

Viewers also liked (7)

Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Reducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive StreamsReducing Microservice Complexity with Kafka and Reactive Streams
Reducing Microservice Complexity with Kafka and Reactive Streams
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Microservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka EcosystemMicroservices in the Apache Kafka Ecosystem
Microservices in the Apache Kafka Ecosystem
 

Similar to Seattle kafka meetup nov 2015 published siphon

Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ NetflixIdo Shilon
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service confluent
 
Apache kafka
Apache kafkaApache kafka
Apache kafkaMvkZ
 
Apache kafka
Apache kafkaApache kafka
Apache kafkaMvkZ
 
Apache kafka
Apache kafkaApache kafka
Apache kafkaMvkZ
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
F_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxF_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxNIMITJAIN71
 

Similar to Seattle kafka meetup nov 2015 published siphon (20)

Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
F_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxF_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptx
 

More from Nitin Kumar

Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
2019 04 seattle_meetup___kafka_machine_learning___kai_waehnerNitin Kumar
 
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...Nitin Kumar
 
Processing trillions of events per day with apache
Processing trillions of events per day with apacheProcessing trillions of events per day with apache
Processing trillions of events per day with apacheNitin Kumar
 
Ren cao kafka connect
Ren cao   kafka connectRen cao   kafka connect
Ren cao kafka connectNitin Kumar
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bbNitin Kumar
 
EventHub for kafka ecosystems kafka meetup
EventHub for kafka ecosystems   kafka meetupEventHub for kafka ecosystems   kafka meetup
EventHub for kafka ecosystems kafka meetupNitin Kumar
 
Microsoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka serviceMicrosoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka serviceNitin Kumar
 
Net flix kafka seattle meetup
Net flix kafka seattle meetupNet flix kafka seattle meetup
Net flix kafka seattle meetupNitin Kumar
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Nitin Kumar
 
Microsoft kafka load imbalance
Microsoft   kafka load imbalanceMicrosoft   kafka load imbalance
Microsoft kafka load imbalanceNitin Kumar
 
Map r seattle streams meetup oct 2016
Map r seattle streams meetup   oct 2016Map r seattle streams meetup   oct 2016
Map r seattle streams meetup oct 2016Nitin Kumar
 
Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaNitin Kumar
 

More from Nitin Kumar (16)

Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
 
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
Kafka meetup seattle 2019 mirus reliable, high performance replication for ap...
 
Processing trillions of events per day with apache
Processing trillions of events per day with apacheProcessing trillions of events per day with apache
Processing trillions of events per day with apache
 
Ren cao kafka connect
Ren cao   kafka connectRen cao   kafka connect
Ren cao kafka connect
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
 
EventHub for kafka ecosystems kafka meetup
EventHub for kafka ecosystems   kafka meetupEventHub for kafka ecosystems   kafka meetup
EventHub for kafka ecosystems kafka meetup
 
Kafka eos
Kafka eosKafka eos
Kafka eos
 
Microsoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka serviceMicrosoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka service
 
Net flix kafka seattle meetup
Net flix kafka seattle meetupNet flix kafka seattle meetup
Net flix kafka seattle meetup
 
Avvo fkafka
Avvo fkafkaAvvo fkafka
Avvo fkafka
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
Microsoft kafka load imbalance
Microsoft   kafka load imbalanceMicrosoft   kafka load imbalance
Microsoft kafka load imbalance
 
Map r seattle streams meetup oct 2016
Map r seattle streams meetup   oct 2016Map r seattle streams meetup   oct 2016
Map r seattle streams meetup oct 2016
 
Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Seattle kafka meetup nov 2015 published siphon

  • 1. Siphon - Kafka as DataBus in Microsoft Nitin Kumar (nitin.kumar@Microsoft.com) Dev Manager, Microsoft https://www.linkedin.com/in/nikuma
  • 2. Agenda • Scale: Kafka at Microsoft (Bing, Ads, Office) • Use Case: NRT Customer facing reports • Kafka based Streaming Solution • Collector • Consumer Restful APIs • Monitoring: Canary/Audit Trail • Production Experience • Key Takeaways Wednesday, March 16, 2016
  • 3. Scale: Kafka at Microsoft (Ads, Bing, Office) Kafka Brokers 1000+ across 5 Datacenters Operating System Windows Server 2012 R2 Hardware Spec 12 Cores, 32 GB RAM, 4x2 TB HDD (JBOD), 10 GB Network Incoming Events 1 million per sec, (90 Billion per day, 100 TB per day) Outgoing Events 5 million per sec, (1 Trillion per day, 500 TB per day) Kafka Topics/Partitions 50+/5000+ Kafka version 0.8.1.1 (3 way replication) Wednesday, March 16, 2016
  • 4. Problem Wednesday, March 16, 2016 Serving System{Q} {R} Online Fraud Detection ML Classification Aggregation Reporting DB Keyword 1.5 hours 2.5 hours Advertiser Feature Extraction 300 GB/h 200+ Features Stats 25 TB Log Collection Sorting / Partitioning What is the click through rate of my ad, that launched at 5pm?
  • 5. Goals / Design Considerations Wednesday, March 16, 2016 Reduce latency from 4 hours to 15 minutes 99.8% Log completeness Guarantees Check pointing & Failure recovery Exactly Once Semantic Highly Available, Scalable and rolling upgrade Reusing Existing C# Libraries
  • 6. Siphon DataBus Solution {Q} {R} Kafka Audit ML Classification Aggregation Reporting DB Keyword 1-2 sec (Minimize latency) < 15 minutes Advertiser Feature Extraction 100 MBPS 200+ Features Stats 25 TB Wednesday, March 16, 2016 Serving System Online Fraud Detection Kafka as a distributed Queue StreamScope as a distributed processing system StreamScope
  • 7. Siphon Wednesday, March 16, 2016 Asia DC Zookeeper Canary Kafka Collector Agent Services Data Pull (Agent) Services Data Push Device Proxy Services Consumer API (Push/ Pull) Europe DC Zookeeper Canary Kafka US DC Zookeeper Canary Kafka Streaming Batch Audit Trail Open Source Microsoft Internal Siphon
  • 8. Collector – Data Ingestion (Producers) • Http(s) Server • Restful API with SSL support. • Abstraction from Kafka internals (Partition, Kafka version) • Throttling, QPS Monitoring • PII scrubbing • Load balancing/failover Device Proxy Services Collector Kafka Brokers Broker Broker Broker Broker P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 Collector Collector LoadBalancer Services Data Push Agent Services Data Pull (Agent) Wednesday, March 16, 2016 Open Source Microsoft Internal Siphon
  • 9. Consumer API (Push/Pull) • Restful Pull API – Simple consumer • Config driven subscriptions for preconfigured sinks like (HDFS, Cosmos, ELK). Wednesday, March 16, 2016 Config (ZK) Executor Kafka .NET Library Kafka Supported destinations – • Cosmos • Elastic Search • Kafka • HDFS
  • 10. High Level Consumer Monitoring using Canary, Audit Trail Device Proxy Services Collector Kafka Brokers Broker Broker Broker Broker P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 Collector Collector LoadBalancer Services Data Push Agent Services Data Pull (Agent) Wednesday, March 16, 2016 Synthetic message Audit Trail
  • 11. Production Experience • System in production for 15 months • End to End Advertiser report latency of 12+ minutes. • Other use cases from Office, Bing. • Integration with other streaming systems – Storm, Spark. • Monitoring using ELK Wednesday, March 16, 2016
  • 12. Key Takeaways • Scale out with Kafka (50K -> 1M -> multi-million Events Per sec) • Ability to build tunable Auditing/Monitoring • Producer/Consumer Restful API provides a nice abstraction • Config driven Pub/Sub system Wednesday, March 16, 2016
  • 13. “We are Hiring.” Thank You Nitin Kumar (nitin.kumar@Microsoft.com) https://www.linkedin.com/in/nikuma Wednesday, March 16, 2016

Editor's Notes

  1. Client Agent generates a unique BatchId for every batch and appends it to the Extended Header. Client Agent sends a “produced” audit message (BatchId, DateTime, Number of Records) to the audit system. Each consumer, upon receiving the batch, de-serialize the header to extract BatchId and sends a “consumed” audit signal to the Audit system. Audit system compares produced vs consumed audits every 5 minutes and raise alerts.