Monitoring and Troubleshooting a Real Time Pipeline

•Download as PPTX, PDF•

1 like•748 views

Alan Ngai, CTO/Co-Founder, OpsClarity OpsClarity is a performance monitoring solution for stream processing applications. In additional to providing deep component monitoring it leverages data science to proactively identify anomalies across the entire data pipeline and correlates issues across the data and app tier to identify common concerns that impact business. OpsClarity automatically discovers the entire app and data topology and is years ahead of anything else in how it leverages the rich meta-data and network dependency context captured through the topology to provide rich analysis and fastest correlated troubleshooting. This talk will additionally cover integration with Apache Apex.

Technology

Monitoring and Troubleshooting a Real Time Pipeline
Alan Ngai, CTO/Co-Founder, OpsClarity

Businesses are Turning to Data-First Applications
AD Network – Real-time bidding
DDoS Attack Prevention
Fraud Detection
Internet of Things
Financial Services
Real-time Personalization

Data-First Application: Many Moving Parts!
DATA SOURCE MESSAGE BROKER STREAM
PROCESSOR
DATA SINK APPLICATIONS
DATA PIPELINE
ELASTIC
INFRASTRUCTURE
BUSINESS LOGIC AS
MICROSERVICES CODE

Characteristics of Data Pipelines
• Heterogeneous Components

Characteristics of Data Pipelines
• Heterogeneous Components
• Extremely Complex
Storm Master Host
Storm Worker Host
Supervisor Process
Topology
Executor
Spout
Task
Bolt
Task
Bolt
Task
Bolt
Task
METRIC
STORM

Characteristics of Data Pipelines
• Heterogeneous Components
• Highly Complex
• Highly Inter-dependent

Characteristics of Data Pipelines
• Heterogeneous Components
• Highly Interdependent
• Highly Complex
• Painful to Monitor and Debug

Put Data In One Place (don’t rely on this)
Kafka Web Console Spark UI Marvel (Elasticsearch)
Ambari (Hadoop) Ganglia Nagios

Organize Your Concerns Horizontally
• Throughput
• Latency
• Error Rate
• Buffered
• Data Loss
• Duplication
stuff per unit of time
how long it takes to process stuff
how frequently bad stuff happens
how much stuff is piled up
how much stuff is being lost
How much stuff is being duplicated
Matters for all stages in a pipeline!
Matters for all business use cases too!

Organize Your Concerns Horizontally
• Throughput
• Latency
• Error Rate
• Buffered
• Data Loss
• Duplication

…And Also Vertically
Storm Master Host
Storm Worker Host
Supervisor Process
Topology
Executor
Spout
Task
Bolt
Task
Bolt
Task
Bolt
Task
METRIC
STORM

…And Also Vertically
Data Health
Dependency
Health
Service Health
Application
Job/Topology
Health
Node Service
Health
Node System
Health
throughput, latency, errors?
Are Kafka and Zookeeper
healthy?
Is the Storm Master healthy? Are there
adequate resources in the cluster?
Are my application KPI’s within normal
range?
Is my Job well distributed in the
cluster? Are job counters normal?
Are all jobs running on this node
normal?
Are key system metrics (cpu, mem,
network, disk i/o) normal?
Data Health
Dependency
Health
Service Health
Application
Job/Topology
Health
Node Service
Health
Node System
Health

What We Talked About
• Data-First Applications Are Becoming a Thing
• Monitoring Data-First Applications is Hard!
• Get Your Metrics In One Place
• Organize Your Data Horizontally and Vertically

Questions?
Alan Ngai
alan@opsclarity.com

What's hot

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent

SIEM Modernization: Build a Situationally Aware Organization with Apache Kafka®confluent

Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...InfluxData

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Streamsets and sparkHari Shreedharan

Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent

Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward

Lambda architecture with SparkVincent GALOPIN

Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale

How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...Lightbend

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain

Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent

Taboola Road To Scale With Apache Sparktsliwowicz

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent

Webinar: What's new in CDAP 3.5?Cask Data

How to use Standard SQL over Kafka: From the basics to advanced use cases | F...HostedbyConfluent

Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...HostedbyConfluent

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent

Leveraging Mainframe Data for Modern Analyticsconfluent

What's hot (20)

Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...

SIEM Modernization: Build a Situationally Aware Organization with Apache Kafka®

Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Streamsets and spark

Time Series Analysis Using an Event Streaming Platform

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...

Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...

Lambda architecture with Spark

Monitoring Large-Scale Apache Spark Clusters at Databricks

How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017

Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...

Taboola Road To Scale With Apache Spark

KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...

Webinar: What's new in CDAP 3.5?

How to use Standard SQL over Kafka: From the basics to advanced use cases | F...

Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...

Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...

Leveraging Mainframe Data for Modern Analytics

Viewers also liked

Monitor troubleshootingCourtney Casper

Siri hardware troubleshootingsirikeshava

Common Computer Faults and ProblemsSef Cambaliza

TroubleShooting Solutionshuu

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy

Motherboard of a pcTech_MX

Sqoop on Spark for Data IngestionDataWorks Summit

Advanced PC Maintenance and TroubleshootingNatan Mesfin

Basic computer troubleshootingdan0530

How to install windows 7cmark11

Computer hardware troubleshootingJerome Luison

TroubleshootingJulia .

Basic Computer TroubleshootingMeredith Martin

Introduction to Motherboard Makrand Patil

How to install windows 7elboob2025

Windows 7 installation pptNagarajan Kamalakannan

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services

Viewers also liked (17)

Monitor troubleshooting

Siri hardware troubleshooting

Common Computer Faults and Problems

TroubleShooting Solutions

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data

Motherboard of a pc

Sqoop on Spark for Data Ingestion

Advanced PC Maintenance and Troubleshooting

Basic computer troubleshooting

How to install windows 7

Computer hardware troubleshooting

Troubleshooting

Basic Computer Troubleshooting

Introduction to Motherboard

How to install windows 7

Windows 7 installation ppt

AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...

Similar to Monitoring and Troubleshooting a Real Time Pipeline

Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open

datavault2.pptxMounika662749

Data data everywhereMetron

PXL Data Engineering Workshop By Selligent Jonny Daenen

Big Data or Data Warehousing? How to Leverage Both in the EnterpriseDean Hallman

How to build streaming data applications - evaluating the top contendersAkmal Chaudhri

11 Ways Microservices & Dynamic Clouds Break Your MonitoringAbner Germanow

The Cloud Computing and Enterprise ArchitectureDr. Saurabh Katiyar

DRAFT - Enterprise Data and Analytics Architecture Overview for Electric UtilityPrajesh Bhattacharya

Financial impact of Cloud Computingkrisbliesner

Spca2014 navigating clouds sp_con14_mackieNCCOMMS

Data & analytics challenges in a microservice architectureNiels Naglé

Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Adin Ermie

Network and IT OperationsNeo4j

Building your Datalake on AWSAmazon Web Services

Design Continuous Authorization for Rapid Delivery of Mission-Critical Servic...Amazon Web Services

Developing a Continuous Automated Approach to Cloud SecurityAmazon Web Services

MicroServices-Part-1.pdfchanhluc2112

Horses for Courses: Database RoundtableEric Kavanagh

Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Applicati...New Relic

Similar to Monitoring and Troubleshooting a Real Time Pipeline (20)

Data Vault 2.0: Big Data Meets Data Warehousing

datavault2.pptx

Data data everywhere

PXL Data Engineering Workshop By Selligent

Big Data or Data Warehousing? How to Leverage Both in the Enterprise

How to build streaming data applications - evaluating the top contenders

11 Ways Microservices & Dynamic Clouds Break Your Monitoring

The Cloud Computing and Enterprise Architecture

DRAFT - Enterprise Data and Analytics Architecture Overview for Electric Utility

Financial impact of Cloud Computing

Spca2014 navigating clouds sp_con14_mackie

Data & analytics challenges in a microservice architecture

Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...

Network and IT Operations

Building your Datalake on AWS

Design Continuous Authorization for Rapid Delivery of Mission-Critical Servic...

Developing a Continuous Automated Approach to Cloud Security

MicroServices-Part-1.pdf

Horses for Courses: Database Roundtable

Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Applicati...

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Commit 2024 - Secret Management made easyAlfredo García Lavilla

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

"ML in Production",Oleksandr BaganFwdays

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

AI as an Interface for Commercial BuildingsMemoori

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Install Stable Diffusion in windows machinePadma Pradeep

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024

Commit 2024 - Secret Management made easy

The Future of Software Development - Devin AI Innovative Approach.pdf

"Debugging python applications inside k8s environment", Andrii Soldatenko

My INSURER PTE LTD - Insurtech Innovation Award 2024

"ML in Production",Oleksandr Bagan

My Hashitalk Indonesia April 2024 Presentation

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Scanning the Internet for External Cloud Exposures via SSL Certs

Unraveling Multimodality with Large Language Models.pdf

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

AI as an Interface for Commercial Buildings

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

SAP Build Work Zone - Overview L2-L3.pptx

Vertex AI Gemini Prompt Engineering Tips

Install Stable Diffusion in windows machine

Streamlining Python Development: A Guide to a Modern Project Setup

WordPress Websites for Engineers: Elevate Your Brand

Monitoring and Troubleshooting a Real Time Pipeline

1. Monitoring and Troubleshooting a Real Time Pipeline Alan Ngai, CTO/Co-Founder, OpsClarity

2. Businesses are Turning to Data-First Applications AD Network – Real-time bidding DDoS Attack Prevention Fraud Detection Internet of Things Financial Services Real-time Personalization

3. Data-First Application: Many Moving Parts! DATA SOURCE MESSAGE BROKER STREAM PROCESSOR DATA SINK APPLICATIONS DATA PIPELINE ELASTIC INFRASTRUCTURE BUSINESS LOGIC AS MICROSERVICES CODE

4. OpsClarity Runs on Data Pipelines

5. Characteristics of Data Pipelines • Heterogeneous Components

6. Characteristics of Data Pipelines • Heterogeneous Components • Extremely Complex Storm Master Host Storm Worker Host Supervisor Process Topology Executor Spout Task Bolt Task Bolt Task Bolt Task METRIC STORM

7. Characteristics of Data Pipelines • Heterogeneous Components • Highly Complex • Highly Inter-dependent

8. Characteristics of Data Pipelines • Heterogeneous Components • Highly Interdependent • Highly Complex • Painful to Monitor and Debug

9. Put Data In One Place (don’t rely on this) Kafka Web Console Spark UI Marvel (Elasticsearch) Ambari (Hadoop) Ganglia Nagios

10. Organize Your Concerns Horizontally • Throughput • Latency • Error Rate • Buffered • Data Loss • Duplication stuff per unit of time how long it takes to process stuff how frequently bad stuff happens how much stuff is piled up how much stuff is being lost How much stuff is being duplicated Matters for all stages in a pipeline! Matters for all business use cases too!

11. Organize Your Concerns Horizontally • Throughput • Latency • Error Rate • Buffered • Data Loss • Duplication

12. …And Also Vertically Storm Master Host Storm Worker Host Supervisor Process Topology Executor Spout Task Bolt Task Bolt Task Bolt Task METRIC STORM

13. …And Also Vertically Data Health Dependency Health Service Health Application Job/Topology Health Node Service Health Node System Health throughput, latency, errors? Are Kafka and Zookeeper healthy? Is the Storm Master healthy? Are there adequate resources in the cluster? Are my application KPI’s within normal range? Is my Job well distributed in the cluster? Are job counters normal? Are all jobs running on this node normal? Are key system metrics (cpu, mem, network, disk i/o) normal? Data Health Dependency Health Service Health Application Job/Topology Health Node Service Health Node System Health

14. DEMO

15. What We Talked About • Data-First Applications Are Becoming a Thing • Monitoring Data-First Applications is Hard! • Get Your Metrics In One Place • Organize Your Data Horizontally and Vertically

16. Questions? Alan Ngai alan@opsclarity.com

Editor's Notes

Intro CTO/Cofounder of OpsClarity At OpsClarity, we've built an intelligent monitoring platform for stream processing applications Today we’ll be talking about lessons we’ve learned monitoring and troubleshooting real time pipelines
Talk time: 1:30 - First things first: let's talk about data first applications - Any application architecture who’s primary purpose is to extract value out of data Business domains where data is key to solving their associated problems What does an application look like that solve problems in this space?
Let’s break down the pieces key part: Real time Data pipeline In fact, what makes this even more difficult to monitor is that you often have multiple applications running on top of the same pipeline infrastructure For example, user interaction report, price optimizer, fraud detection, there's often an overlap of the input data Elastic infra microservices: account management other web applications That + Business services + infrastructure = data first app
Talk time: 2:00 Total: 5:00 Topo Discovery: discover and build service cluster topology in real time Anomaly Detection: automatically baseline and discovery anomalies in metrics Health Aggregation: aggregate health from port checks and http checks with metric AD into host health and then service cluster health
Talk time: 0:30 Total: 5:30 Any data pipeline that you build today will include a number of these components, as well as some that are not included here **audience interaction**: Who’s companies are using some of these technologies? ex: 0mq, flume, flink, datatorrent, heron, solr , etc.
Talk time: 1:30 Total: 7:00 Scenario: device data -> api gateway -> message broker -> data analysis -> data store <- node <- analytics reports **audience interaction**: What if you notice something wonky data in the reports you’re seeing? Where can the problem be? - time consuming process to debug - chances are you have different people responsible for different parts of the system, which makes it that much harder to troubleshoot
Talk time: 0:30 Total: 9:30
Get all your data in one place Organize your data so that you can look at cross component concerns Organize your data so that you can explore data hierarchically (data, dependency, app, etc.)
Talk time: 1:30 Total: 12:00 Throughput: clicks per second, docs per second, network bytes in/out Latency: processing time, queued time Error Rates: exception count, 500 errors regardless of what business problem you're solving, as long as you have a data pipeline, you care about these concerns these concerns also matter at every stage of your pipelie
Talk time: 1:00 Total: 13:00 Talk through throughput metrics for each component Point out that noise from other metrics (cpu, load, mem, etc.) are filtered out **audience interaction**: Where do you think the problem is? That’s right, play! We can do the same for latency, error rate, etc.
Talk time: 0:30 Total: 13:30 Remember this slide? In order to make sense of the metrics from different sub-components, we need to organize them into tiers
Talk time: 1:30 Total: 15:00 Remember this slide? In order to make sense of the metrics from different sub-components, we need to organize them into tiers

Monitoring and Troubleshooting a Real Time Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Monitoring and Troubleshooting a Real Time Pipeline

Similar to Monitoring and Troubleshooting a Real Time Pipeline (20)

More from Apache Apex

More from Apache Apex (20)

Recently uploaded

Recently uploaded (20)

Monitoring and Troubleshooting a Real Time Pipeline

Editor's Notes