SlideShare a Scribd company logo
1 of 70
Netflix
Performance Meetup
Global Client Performance
Fast Metrics
3G in Kazakhstan
● Global Internet:
● faster (better networking)
● slower (broader reach, congestion)
● Don't wait for it, measure it and deal
● Working app > Feature rich app
Making the Internet fast
is slow.
We need to know what the Internet looks like,
without averages, seeing the full distribution.
● Sampling
○ Missed data
○ Rare events
○ Problems aren’t equal in
Population
● Averages
○ Can't see the distribution
○ Outliers heavily distort
∞, 0, negatives, errors
Logging Anti-Patterns
Instead, use the client as a map-reducer and send up aggregated
data, less often.
Sizing up the Internet.
Infinite (free) compute power!
● Calculate the inverse empirical cumulative
distribution function by math.
Get median, 95th, etc.
> library(HistogramTools)
> iecdf <- HistToEcdf(histogram,
method='linear’, inverse=TRUE)
> iecdf(0.5)
[1] 0.7975309 # median
> iecdf(0.95)
[1] 4.65 # 95th
percentile
o ...or just use R which is free and knows how
to do it already
Data > Opinions.
Better than debating opinions.
Architecture is hard. Make it cheap to experiment where your users really are.
"There's no way that the
client makes that many
requests.”
"No one really minds the
spinner."
"Why should we spend
time on that instead of
COOLFEATURE?"
"We live in a
50ms world!"
We built Daedalus
US
Elsewhere
Fast
Slow
DNS Time
● Visual → Numerical, need the IECDF for
Percentiles
○ ƒ(0.50) = 50th
(median)
○ ƒ(0.95) = 95th
● Cluster to get pretty colors similar experiences.
(k-means, hierarchical, etc.)
Interpret the data
● Go there!
● Abstract analysis - hard
● Feeling reality is much simpler than looking at graphs. Build!
Practical Teleportation.
Make a Reality Lab.
Don't guess.
Developing a model based on
production data, without missing the
distribution of samples (network, render,
responsiveness) will lead to better
software.
Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
Icarus
Martin Spier
@spiermar
Performance Engineering @ Netflix
Problem & Motivation
● Real-user performance monitoring solution
● More insight into the App performance
(as perceived by real users)
● Too many variables to trust synthetic
tests and labs
● Prioritize work around App performance
● Track App improvement progress over time
● Detect issues, internal and external
Device Diversity
● Netflix runs on all sorts of devices
● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ...
● Consistently evaluate performance
What are we monitoring?
● User Actions
(or things users do in the App)
● App Startup
● User Navigation
● Playing a Title
● Internal App metrics
What are we measuring?
● When does the timer start and stop?
● Time-to-Interactive (TTI)
○ Interactive, even if
some items were not fully
loaded and rendered
● Time-to-Render (TTR)
○ Everything above the fold
(visible without scrolling)
is rendered
● Play Delay
● Meaningful for what we are monitoring
High-dimensional Data
● Complex device categorization
● Geo regions, subregions, countries
● Highly granular network
classifications
● High volume of A/B tests
● Different facets of the same user action
○ Cold, suspended and backgrounded
App startups
○ Target view/page on App startup
Data Sketches
● Data structures that approximately
resemble a much larger data set
● Preserve essential features!
● Significantly smaller!
● Faster to operate on!
t-Digest
● t-Digest data structure
● Rank-based statistics
(such as quantiles)
● Parallel friendly
(can be merged!)
● Very fast!
● Really accurate!
https://github.com/tdunning/t-digest
+ t-Digest sketches
iOS Median Comparison, Break by Country
iOS Median Comparison, Break by Country + iPhone 6S Plus
CDFs by UI Version
Warm Startup Rate
A/B Cell Comparison
Anomaly Detection
Going Forward
● Resource utilization metrics
● Device profiling
○ Instrumenting client code
● Explore other visualizations
○ Frequency heat maps
● Connection between perceived
performance, acquisition and
retention
@spiermar
Netflix
Autoscaling for experts
Vadim
● Mid-tier stateless services are ~2/3rd of the total
● Savings - 30% of mid-tier footprint (roughly 30K instances)
○ Higher savings if we break it down by region
○ Even higher savings on services that scale well
Savings!
Why we autoscale - philosophical reasons
Why we autoscale - pragmatic reasons
● Encoding
● Precompute
● Failover
● Red/black pushes
● Curing cancer**
● And more...
** Hack-day project
Should you autoscale?
Benefits
● On-demand capacity: direct $$ savings
● RI capacity: re-purposing spare capacity
However, for each server group, beware of
● Uneven distribution of traffic
● Sticky traffic
● Bursty traffic
● Small ASG sizes (<10)
Autoscaling impacts availability - true or false?
* If done correctly
Under-provisioning, however, can impact availability
● Autoscaling is not a problem
● The real problem is not knowing performance characteristics of the
service
AWS autoscaling mechanics
CloudWatch alarm ASG scaling policy
Aggregated metric feed
Notification
Tunables
Metric ● Threshold
● # of eval periods
● Scaling amount
● Warmup time
What metric to scale on?
Pros
● Tracks a direct measure of work
● Linear scaling
● Predictable
● Requires less adjustment over time
Cons
● Thresholds tend to drift over time
● Prone to changes in request mixture
● Less predictable
● More oscillation / jitter
Throughput
Resource
utilization
Autoscaling on multiple metrics
Proceed with caution
● Harder to reason about scaling behavior
● Different metrics might contradict each
other, causing oscillation
Typical Netflix configuration:
● Scale-up policy on throughput
● Scale-down policy on throughput
● Emergency scale-up policy on CPU, aka
“the hammer rule”
Well-behaved autoscaling
Common mistakes - “no rush” scaling
Problem: scaling amounts too
small, cooldown too long
Effect: scaling lags behind the
traffic flow. Not enough
capacity at peak, capacity
wasted in trough
Remedy: increase scaling
amounts, migrate to step
policies
Common mistakes - twitchy scaling
Problem: Scale-up policy is
too aggressive
Effect: unnecessary
capacity churn
Remedy: reduce scale-up
amount, increase the # of
eval periods
Common mistakes - should I stay or should I go
Problem: -up and -down
thresholds are too close to each
other
Effect: constant capacity
oscillation
Remedy: move -up and -down
thresholds farther apart
AWS target tracking - your best bet!
● Think of it as a step policy with auto-steps
● You can also think of it as a thermostat
● Accounts for the rate of change in monitored metric
● Pick a metric, set the target value and warmup time - that’s it!
Step Target-tracking
Netflix
PMCs on the Cloud
Brendan
Busy
Waiting
(“idle”)
90% CPU utilization:
Busy
Waiting
(“idle”)
Busy
Waiting
(“idle”)
Waiting
(“stalled”)
Reality:
90% CPU utilization:
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%)
7,562 context-switches # 0.095 K/sec (100.00%)
1,157 cpu-migrations # 0.014 K/sec (100.00%)
109,734 page-faults # 0.001 M/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
10.001715965 seconds time elapsed
Performance
Monitoring Counters
(PMCs) in most clouds
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%]
1,047,222 context-switches # 0.002 M/sec [100.00%]
83,420 cpu-migrations # 0.130 K/sec [100.00%]
38,905 page-faults # 0.061 K/sec
655,419,788,755 cycles # 1.022 GHz [75.02%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
536,830,399,277 instructions # 0.82 insns per cycle [75.02%]
97,103,651,128 branches # 151.412 M/sec [75.02%]
1,230,478,597 branch-misses # 1.27% of all branches [74.99%]
10.001622154 seconds time elapsed
AWS EC2 m4.16xl
Interpreting IPC & Actionable Items
IPC: Instructions Per Cycle (invert of CPI)
● IPC < 1.0: likely memory stalled
○ Data usage and layout to improve CPU caching, memory locality.
○ Choose larger CPU caches, faster memory busses and interconnects.
● IPC > 1.0: likely instruction bound
○ Reduce code execution, eliminate unnecessary work, cache operations,
improve algorithm order. Can analyze using CPU flame graphs.
○ Faster CPUs.
Event Name Umask Event S. Example Event Mask Mnemonic
UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P
Instruction Retired 00H C0H INST_RETIRED.ANY_P
UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE
LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS
Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES
Intel Architectural PMCs
Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
# pmcarch 1
CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65
75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93
65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87
90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59
76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83
[...]
tiptop - [root]
Tasks: 96 total, 3 displayed screen 0: default
PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND
3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java
1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet
900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo
https://github.com/brendangregg/pmc-cloud-tools
Netflix
Performance Meetup
Netflix
Performance Meetup

More Related Content

What's hot

CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceSUSE Labs Taipei
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...confluent
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~NTT Communications Technology Development
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and ToolsBrendan Gregg
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaOCoderFest
 
コンテナ時代のOpenStack
コンテナ時代のOpenStackコンテナ時代のOpenStack
コンテナ時代のOpenStackAkira Yoshiyama
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
Nfs protocol sequence_diagram
Nfs protocol sequence_diagramNfs protocol sequence_diagram
Nfs protocol sequence_diagramRoberto Castro
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Distributed Tracing in Practice
Distributed Tracing in PracticeDistributed Tracing in Practice
Distributed Tracing in PracticeDevOps.com
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsBrendan Gregg
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 

What's hot (20)

CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
macOSの仮想化技術について ~Virtualization-rs Rust bindings for virtualization.framework ~
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
 
コンテナ時代のOpenStack
コンテナ時代のOpenStackコンテナ時代のOpenStack
コンテナ時代のOpenStack
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Nfs protocol sequence_diagram
Nfs protocol sequence_diagramNfs protocol sequence_diagram
Nfs protocol sequence_diagram
 
eBPF maps 101
eBPF maps 101eBPF maps 101
eBPF maps 101
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Distributed Tracing in Practice
Distributed Tracing in PracticeDistributed Tracing in Practice
Distributed Tracing in Practice
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 

Similar to Netflix SRE perf meetup_slides

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUGslandelle
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyScyllaDB
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaMarcel Birkner
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...Altinity Ltd
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary ReleaseBilly Yuen
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Martin Spier
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 

Similar to Netflix SRE perf meetup_slides (20)

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring Latency
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary Release
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 

Recently uploaded

Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptxrouholahahmadi9876
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...vershagrag
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfsumitt6_25730773
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 

Recently uploaded (20)

Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 

Netflix SRE perf meetup_slides

  • 4. ● Global Internet: ● faster (better networking) ● slower (broader reach, congestion) ● Don't wait for it, measure it and deal ● Working app > Feature rich app Making the Internet fast is slow.
  • 5. We need to know what the Internet looks like, without averages, seeing the full distribution.
  • 6. ● Sampling ○ Missed data ○ Rare events ○ Problems aren’t equal in Population ● Averages ○ Can't see the distribution ○ Outliers heavily distort ∞, 0, negatives, errors Logging Anti-Patterns Instead, use the client as a map-reducer and send up aggregated data, less often.
  • 7. Sizing up the Internet.
  • 9.
  • 10. ● Calculate the inverse empirical cumulative distribution function by math. Get median, 95th, etc. > library(HistogramTools) > iecdf <- HistToEcdf(histogram, method='linear’, inverse=TRUE) > iecdf(0.5) [1] 0.7975309 # median > iecdf(0.95) [1] 4.65 # 95th percentile o ...or just use R which is free and knows how to do it already
  • 11.
  • 12.
  • 14. Better than debating opinions. Architecture is hard. Make it cheap to experiment where your users really are. "There's no way that the client makes that many requests.” "No one really minds the spinner." "Why should we spend time on that instead of COOLFEATURE?" "We live in a 50ms world!"
  • 16. ● Visual → Numerical, need the IECDF for Percentiles ○ ƒ(0.50) = 50th (median) ○ ƒ(0.95) = 95th ● Cluster to get pretty colors similar experiences. (k-means, hierarchical, etc.) Interpret the data
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. ● Go there! ● Abstract analysis - hard ● Feeling reality is much simpler than looking at graphs. Build! Practical Teleportation.
  • 23.
  • 24. Don't guess. Developing a model based on production data, without missing the distribution of samples (network, render, responsiveness) will lead to better software. Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
  • 26.
  • 27. Problem & Motivation ● Real-user performance monitoring solution ● More insight into the App performance (as perceived by real users) ● Too many variables to trust synthetic tests and labs ● Prioritize work around App performance ● Track App improvement progress over time ● Detect issues, internal and external
  • 28. Device Diversity ● Netflix runs on all sorts of devices ● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ... ● Consistently evaluate performance
  • 29.
  • 30. What are we monitoring? ● User Actions (or things users do in the App) ● App Startup ● User Navigation ● Playing a Title ● Internal App metrics
  • 31. What are we measuring? ● When does the timer start and stop? ● Time-to-Interactive (TTI) ○ Interactive, even if some items were not fully loaded and rendered ● Time-to-Render (TTR) ○ Everything above the fold (visible without scrolling) is rendered ● Play Delay ● Meaningful for what we are monitoring
  • 32. High-dimensional Data ● Complex device categorization ● Geo regions, subregions, countries ● Highly granular network classifications ● High volume of A/B tests ● Different facets of the same user action ○ Cold, suspended and backgrounded App startups ○ Target view/page on App startup
  • 33.
  • 34.
  • 35.
  • 36. Data Sketches ● Data structures that approximately resemble a much larger data set ● Preserve essential features! ● Significantly smaller! ● Faster to operate on!
  • 37. t-Digest ● t-Digest data structure ● Rank-based statistics (such as quantiles) ● Parallel friendly (can be merged!) ● Very fast! ● Really accurate! https://github.com/tdunning/t-digest
  • 39.
  • 40. iOS Median Comparison, Break by Country
  • 41. iOS Median Comparison, Break by Country + iPhone 6S Plus
  • 42. CDFs by UI Version
  • 46. Going Forward ● Resource utilization metrics ● Device profiling ○ Instrumenting client code ● Explore other visualizations ○ Frequency heat maps ● Connection between perceived performance, acquisition and retention @spiermar
  • 48. ● Mid-tier stateless services are ~2/3rd of the total ● Savings - 30% of mid-tier footprint (roughly 30K instances) ○ Higher savings if we break it down by region ○ Even higher savings on services that scale well Savings!
  • 49. Why we autoscale - philosophical reasons
  • 50. Why we autoscale - pragmatic reasons ● Encoding ● Precompute ● Failover ● Red/black pushes ● Curing cancer** ● And more... ** Hack-day project
  • 51. Should you autoscale? Benefits ● On-demand capacity: direct $$ savings ● RI capacity: re-purposing spare capacity However, for each server group, beware of ● Uneven distribution of traffic ● Sticky traffic ● Bursty traffic ● Small ASG sizes (<10)
  • 52. Autoscaling impacts availability - true or false? * If done correctly Under-provisioning, however, can impact availability ● Autoscaling is not a problem ● The real problem is not knowing performance characteristics of the service
  • 53. AWS autoscaling mechanics CloudWatch alarm ASG scaling policy Aggregated metric feed Notification Tunables Metric ● Threshold ● # of eval periods ● Scaling amount ● Warmup time
  • 54. What metric to scale on? Pros ● Tracks a direct measure of work ● Linear scaling ● Predictable ● Requires less adjustment over time Cons ● Thresholds tend to drift over time ● Prone to changes in request mixture ● Less predictable ● More oscillation / jitter Throughput Resource utilization
  • 55. Autoscaling on multiple metrics Proceed with caution ● Harder to reason about scaling behavior ● Different metrics might contradict each other, causing oscillation Typical Netflix configuration: ● Scale-up policy on throughput ● Scale-down policy on throughput ● Emergency scale-up policy on CPU, aka “the hammer rule”
  • 57. Common mistakes - “no rush” scaling Problem: scaling amounts too small, cooldown too long Effect: scaling lags behind the traffic flow. Not enough capacity at peak, capacity wasted in trough Remedy: increase scaling amounts, migrate to step policies
  • 58. Common mistakes - twitchy scaling Problem: Scale-up policy is too aggressive Effect: unnecessary capacity churn Remedy: reduce scale-up amount, increase the # of eval periods
  • 59. Common mistakes - should I stay or should I go Problem: -up and -down thresholds are too close to each other Effect: constant capacity oscillation Remedy: move -up and -down thresholds farther apart
  • 60. AWS target tracking - your best bet! ● Think of it as a step policy with auto-steps ● You can also think of it as a thermostat ● Accounts for the rate of change in monitored metric ● Pick a metric, set the target value and warmup time - that’s it! Step Target-tracking
  • 61. Netflix PMCs on the Cloud Brendan
  • 64. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%) 7,562 context-switches # 0.095 K/sec (100.00%) 1,157 cpu-migrations # 0.014 K/sec (100.00%) 109,734 page-faults # 0.001 M/sec <not supported> cycles <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend <not supported> instructions <not supported> branches <not supported> branch-misses 10.001715965 seconds time elapsed Performance Monitoring Counters (PMCs) in most clouds
  • 65. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%] 1,047,222 context-switches # 0.002 M/sec [100.00%] 83,420 cpu-migrations # 0.130 K/sec [100.00%] 38,905 page-faults # 0.061 K/sec 655,419,788,755 cycles # 1.022 GHz [75.02%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 536,830,399,277 instructions # 0.82 insns per cycle [75.02%] 97,103,651,128 branches # 151.412 M/sec [75.02%] 1,230,478,597 branch-misses # 1.27% of all branches [74.99%] 10.001622154 seconds time elapsed AWS EC2 m4.16xl
  • 66. Interpreting IPC & Actionable Items IPC: Instructions Per Cycle (invert of CPI) ● IPC < 1.0: likely memory stalled ○ Data usage and layout to improve CPU caching, memory locality. ○ Choose larger CPU caches, faster memory busses and interconnects. ● IPC > 1.0: likely instruction bound ○ Reduce code execution, eliminate unnecessary work, cache operations, improve algorithm order. Can analyze using CPU flame graphs. ○ Faster CPUs.
  • 67. Event Name Umask Event S. Example Event Mask Mnemonic UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P Instruction Retired 00H C0H INST_RETIRED.ANY_P UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES Intel Architectural PMCs Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
  • 68. # pmcarch 1 CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC% 90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65 75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93 65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87 90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59 76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83 [...] tiptop - [root] Tasks: 96 total, 3 displayed screen 0: default PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND 3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java 1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet 900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo https://github.com/brendangregg/pmc-cloud-tools