Alan Ngai, CTO/Co-Founder, OpsClarity
OpsClarity is a performance monitoring solution for stream processing applications. In additional to providing deep component monitoring it leverages data science to proactively identify anomalies across the entire data pipeline and correlates issues across the data and app tier to identify common concerns that impact business. OpsClarity automatically discovers the entire app and data topology and is years ahead of anything else in how it leverages the rich meta-data and network dependency context captured through the topology to provide rich analysis and fastest correlated troubleshooting. This talk will additionally cover integration with Apache Apex.
2. Businesses are Turning to Data-First Applications
AD Network – Real-time bidding
DDoS Attack Prevention
Fraud Detection
Internet of Things
Financial Services
Real-time Personalization
3. Data-First Application: Many Moving Parts!
DATA SOURCE MESSAGE BROKER STREAM
PROCESSOR
DATA SINK APPLICATIONS
DATA PIPELINE
ELASTIC
INFRASTRUCTURE
BUSINESS LOGIC AS
MICROSERVICES CODE
7. Characteristics of Data Pipelines
• Heterogeneous Components
• Highly Complex
• Highly Inter-dependent
8. Characteristics of Data Pipelines
• Heterogeneous Components
• Highly Interdependent
• Highly Complex
• Painful to Monitor and Debug
9. Put Data In One Place (don’t rely on this)
Kafka Web Console Spark UI Marvel (Elasticsearch)
Ambari (Hadoop) Ganglia Nagios
10. Organize Your Concerns Horizontally
• Throughput
• Latency
• Error Rate
• Buffered
• Data Loss
• Duplication
stuff per unit of time
how long it takes to process stuff
how frequently bad stuff happens
how much stuff is piled up
how much stuff is being lost
How much stuff is being duplicated
Matters for all stages in a pipeline!
Matters for all business use cases too!
11. Organize Your Concerns Horizontally
• Throughput
• Latency
• Error Rate
• Buffered
• Data Loss
• Duplication
13. …And Also Vertically
Data Health
Dependency
Health
Service Health
Application
Job/Topology
Health
Node Service
Health
Node System
Health
throughput, latency, errors?
Are Kafka and Zookeeper
healthy?
Is the Storm Master healthy? Are there
adequate resources in the cluster?
Are my application KPI’s within normal
range?
Is my Job well distributed in the
cluster? Are job counters normal?
Are all jobs running on this node
normal?
Are key system metrics (cpu, mem,
network, disk i/o) normal?
Data Health
Dependency
Health
Service Health
Application
Job/Topology
Health
Node Service
Health
Node System
Health
15. What We Talked About
• Data-First Applications Are Becoming a Thing
• Monitoring Data-First Applications is Hard!
• Get Your Metrics In One Place
• Organize Your Data Horizontally and Vertically
Intro
CTO/Cofounder of OpsClarity
At OpsClarity, we've built an intelligent monitoring platform for stream processing applications
Today we’ll be talking about lessons we’ve learned monitoring and troubleshooting real time pipelines
Talk time: 1:30
- First things first: let's talk about data first applications
- Any application architecture who’s primary purpose is to extract value out of data
Business domains where data is key to solving their associated problems
What does an application look like that solve problems in this space?
Let’s break down the pieces
key part: Real time Data pipeline
In fact, what makes this even more difficult to monitor is that you often have multiple applications running on top of the same pipeline infrastructure
For example, user interaction report, price optimizer, fraud detection, there's often an overlap of the input data
Elastic infra
microservices:
account management
other web applications
That + Business services + infrastructure = data first app
Talk time: 2:00
Total: 5:00
Topo Discovery: discover and build service cluster topology in real time
Anomaly Detection: automatically baseline and discovery anomalies in metrics
Health Aggregation: aggregate health from port checks and http checks with metric AD into host health and then service cluster health
Talk time: 0:30
Total: 5:30
Any data pipeline that you build today will include a number of these components, as well as some that are not included here
**audience interaction**: Who’s companies are using some of these technologies?
ex: 0mq, flume, flink, datatorrent, heron, solr , etc.
Talk time: 1:30
Total: 7:00
Scenario: device data -> api gateway -> message broker -> data analysis -> data store <- node <- analytics reports
**audience interaction**: What if you notice something wonky data in the reports you’re seeing? Where can the problem be?
- time consuming process to debug
- chances are you have different people responsible for different parts of the system, which makes it that much harder to troubleshoot
Talk time: 0:30
Total: 9:30
Get all your data in one place
Organize your data so that you can look at cross component concerns
Organize your data so that you can explore data hierarchically (data, dependency, app, etc.)
Talk time: 1:30
Total: 12:00
Throughput: clicks per second, docs per second, network bytes in/out
Latency: processing time, queued time
Error Rates: exception count, 500 errors
regardless of what business problem you're solving, as long as you have a data pipeline, you care about these concerns
these concerns also matter at every stage of your pipelie
Talk time: 1:00
Total: 13:00
Talk through throughput metrics for each component
Point out that noise from other metrics (cpu, load, mem, etc.) are filtered out
**audience interaction**: Where do you think the problem is?
That’s right, play!
We can do the same for latency, error rate, etc.
Talk time: 0:30
Total: 13:30
Remember this slide?
In order to make sense of the metrics from different sub-components, we need to organize them into tiers
Talk time: 1:30
Total: 15:00
Remember this slide?
In order to make sense of the metrics from different sub-components, we need to organize them into tiers