SlideShare a Scribd company logo
1 of 92
Download to read offline
Java Mixed-Mode
Flame Graphs
Brendan Gregg Senior Performance Architect
Oct	
  2015	
  
Understanding Java CPU usage
quickly and completely
Quickly
•  Via SSH and open source tools (covered in this talk)
•  Or using Netflix Vector GUI (also open source):
1.  Observe high CPU usage
2.  Generate a flame graph
Java Mixed-Mode Flame Graph
via Linux perf_events:
Completely
Java JVM
Kernel
GC
Messy House Fallacy
•  Don't overlook system code: kernel, libraries, etc.
Fallacy:	
  my	
  code	
  is	
  a	
  mess,	
  I	
  bet	
  yours	
  is	
  
immaculate,	
  therefore	
  the	
  bug	
  must	
  be	
  mine	
  
	
  
Reality:	
  everyone's	
  code	
  is	
  terrible	
  and	
  buggy	
  
Context
•  Over 60 million subscribers
–  Just launched in Spain!
•  AWS EC2 Linux cloud
•  FreeBSD CDN
•  Awesome place to work
Cloud
•  Tens of thousands of AWS EC2 instances
•  Mostly running Java applications (Oracle JVM)
Linux	
  (usually	
  Ubuntu)	
  
Java	
  (JDK	
  8)	
  
Tomcat	
  GC	
  and	
  
thread	
  
dump	
  
logging	
  
hystrix,	
  metrics	
  (Servo),	
  
health	
  check	
  
OpMonal	
  Apache,	
  
memcached,	
  Node.js,	
  
…	
  
Atlas,	
  S3	
  log	
  rotaMon,	
  
sar,	
  Trace,	
  perf,	
  stap,	
  
perf-­‐tools	
  
Vector,	
  pcp	
  
ApplicaMon	
  war	
  files,	
  
plaYorm,	
  base	
  servlet	
  
Why we need CPU profiling
•  Improving performance
–  Identify tuning targets
–  Incident response
–  Non-regression testing
–  Software evaluations
–  CPU workload
characterization
•  Cost savings
–  ASGs often scale on load
average (CPU), so CPU
usage is proportional to cost
Instance	
  
Instance	
  
Instance	
  
Scaling	
  Policy	
  
loadavg,	
  latency,	
  …	
  
	
  
CloudWatch,	
  Servo	
  
Auto	
  Scaling	
  
Group	
  
The Problem with Profilers
Java Profilers
Java
GC
Kernel,
libraries,
JVM
Java Profilers
•  Visibility
–  Java method execution
–  Object usage
–  GC logs
–  Custom Java context
•  Typical problems:
–  Sampling often happens at safety/yield points (skew)
–  Method tracing has massive observer effect
–  Misidentifies RUNNING as on-CPU (e.g., epoll)
–  Doesn't include or profile GC or JVM CPU time
–  Tree views not quick (proportional) to comprehend
•  Inaccurate (skewed) and incomplete profiles
System Profilers
Java Kernel
TCP/IP
GC
Idle
thread
Time
Locks epoll
JVM
System Profilers
•  Visibility
–  JVM (C++)
–  GC (C++)
–  libraries (C)
–  kernel (C)
•  Typical problems (x86):
–  Stacks missing for Java
–  Symbols missing for Java methods
•  Other architectures (e.g., SPARC) have fared better
•  Profile everything except Java
Workaround
•  Capture both Java and system profiles, and examine
side by side
•  An improvement, but Java context is often crucial for
interpreting system profiles
Java System
Java Mixed-Mode Flame Graph
Solution
Java JVM
Kernel
GC
Solution
•  Fix system profiling
–  Only way to see it all
•  Visibility is everything:
–  Java methods
–  JVM (C++)
–  GC (C++)
–  libraries (C)
–  kernel (C)
•  Minor Problems:
–  0-3% CPU overhead to enable frame pointers (usually <1%).
–  Symbol dumps can consume a burst of CPU
•  Complete and accurate (asynchronous) profiling
Java
JVM
Kernel
GC
Simple Production Example
1.  Poor performance,
and one CPU at 100%
2.  perf_events flame
graph shows JVM
stuck compiling
Another System Example
Exception handling consuming CPU
DEMO
FlameGraph_tomcat01.svg
Exonerating The System
•  From last week:
-  Frequent thread creation/
destruction assumed to be
consuming CPU resources.
Recode application?
-  A flame graph quantified this
CPU time: near zero
-  Time mostly other Java methods
Profiling GC
GC internals, visualized:
CPU Profiling
CPU Profiling
A
B
block interrupt
on-CPU off-CPU
A
B
A A
B
A
syscall
time
•  Record stacks at a timed interval: simple and effective
–  Pros: Low (deterministic) overhead
–  Cons: Coarse accuracy, but usually sufficient
stack
samples: A
Stack Traces
•  A code path snapshot. e.g., from jstack(1):
$ jstack 1819
[…]
"main" prio=10 tid=0x00007ff304009000
nid=0x7361 runnable [0x00007ff30d4f9000]
java.lang.Thread.State: RUNNABLE
at Func_abc.func_c(Func_abc.java:6)
at Func_abc.func_b(Func_abc.java:16)
at Func_abc.func_a(Func_abc.java:23)
at Func_abc.main(Func_abc.java:27)
running
parent
g.parent
g.g.paren
running
codepath
start
System Profilers
•  Linux
–  perf_events (aka "perf")
•  Oracle Solaris
–  DTrace
•  OS X
–  Instruments
•  Windows
–  XPerf
•  And many others…
Linux perf_events
•  Standard Linux profiler
–  Provides the perf command (multi-tool)
–  Usually pkg added by linux-tools-common, etc.
•  Features:
–  Timer-based sampling
–  Hardware events
–  Tracepoints
–  Dynamic tracing
•  Can sample stacks of (almost) everything on CPU
–  Can miss hard interrupt ISRs, but these should be near-zero. They can
be measured if needed (I wrote my own tools)
perf record Profiling
•  Stack profiling on all CPUs at 99 Hertz, then dump:
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf script
[…]
bash 13204 cpu-clock:
459c4c dequote_string (/root/bash-4.3/bash)
465c80 glob_expand_word_list (/root/bash-4.3/bash)
466569 expand_word_list_internal (/root/bash-4.3/bash)
465a13 expand_words (/root/bash-4.3/bash)
43bbf7 execute_simple_command (/root/bash-4.3/bash)
435f16 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
43a771 execute_while_or_until (/root/bash-4.3/bash)
43a636 execute_while_command (/root/bash-4.3/bash)
436129 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
420cd5 reader_loop (/root/bash-4.3/bash)
41ea58 main (/root/bash-4.3/bash)
7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[… ~47,000 lines truncated …]
one
stack
sample
perf report Summary
•  Generates a call tree and combines samples:
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
call tree
summary
Flame Graphs
perf report Verbosity
•  Despite summarizing, output is still verbose
# perf report -n -stdio
[…]
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .............................
#
20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version
|
--- xen_hypercall_xen_version
check_events
|
|--44.13%-- syscall_trace_enter
| tracesys
| |
| |--35.58%-- __GI___libc_fcntl
| | |
| | |--65.26%-- do_redirection_internal
| | | do_redirections
| | | execute_builtin_or_function
| | | execute_simple_command
[… ~13,000 lines truncated …]
Full perf report Output
… as a Flame Graph
Flame Graphs
•  Flame Graphs:
–  x-axis: alphabetical stack sort, to maximize merging
–  y-axis: stack depth
–  color: random (default), or a dimension
•  Currently made from Perl + SVG + JavaScript
–  Multiple d3 versions are being developed
•  Easy to get working
–  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
–  Above commands are Linux; see URL for other OSes
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
Linux perf_events Workflow
perf stat perf record
perf report perf script
count events capture stacks
text UI dump profile
stackcollapse-perf.pl
flamegraph.pl
perf.data	
  
flame graph
visualization
perf list
list events
Typical
Workflow
Flame Graph Interpretation
a()
b() h()
c()
d()
e() f()
g()
i()
Flame Graph Interpretation (1/3)
Top edge shows who is running on-CPU,
and how much (width)
a()
b() h()
c()
d()
e() f()
g()
i()
Flame Graph Interpretation (2/3)
Top-down shows ancestry
e.g., from g():
h()
d()
e()
i()
a()
b()
c()
f()
g()
Flame Graph Interpretation (3/3)
a()
b() h()
c()
d()
e() f()
g()
i()
Widths are proportional to presence in samples
e.g., comparing b() to h() (incl. children)
Flame Graph Colors
•  Randomized by default
•  Can be used as a dimension. e.g.:
–  Mixed-mode flame graphs
–  Differential flame graphs
–  Search
Mixed-Mode Flame Graphs
•  Hues:
–  green == Java
–  red == system
–  yellow == C++
•  Intensity randomized
to differentiate frames
–  Or hashed based on
function name
Java JVM
Kernel
Mixed-Mode
Differential Flame Graphs
•  Hues:
–  red == more samples
–  blue == less samples
•  Intensity shows the
degree of difference
•  Used for comparing
two profiles
•  Also used for showing
other metrics: e.g., CPI
Differential
more less
Flame Graph Search
•  Color: magenta to show matched frames
search
button
Flame Charts
•  Flame charts: x-axis is time
•  Flame graphs: x-axis is population (maximize merging)
•  Final note: these are useful, but are not flame graphs
Stack Tracing
System Profiling Java on x86
•  For example,
using Linux perf
•  The stacks are
1 or 2 levels
deep, and have
junk values
# perf record –F 99 –a –g – sleep 30
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
… as a Flame Graph
Broken Java stacks
(missing frame pointer)
Why Stacks are Broken
•  On x86 (x86_64), hotspot uses
the frame pointer register (RBP)
as general purpose
•  This "compiler optimization"
breaks (simple) stack walking
•  Once upon a time, x86 had fewer
registers, and this made much more sense
•  gcc provides -fno-omit-frame-pointer to avoid
doing this, but the JVM had no such option…
Fixing Stack Walking
Possibilities:
A.  Fix frame pointer-based stack walking (the default)
–  Pros: simple, supported by many tools
–  Cons: might cost a little extra CPU
B.  Use a custom walker (likely needing kernel support)
–  Pros: full stack walking (incl. inlining) & arguments
–  Cons: custom kernel code, can cost more CPU when in use
C.  Try libunwind and DWARF
–  Even feasible with JIT?
Our current preference is (A)
Hacking OpenJDK (1/2)
•  As a proof of concept, I hacked hotspot to support an
x86_64 frame pointer
--- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 …
+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 …
@@ -166,10 +166,9 @@
// 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
//
-// Class for all pointer registers (including RSP)
+// Class for all pointer registers (including RSP, excluding RBP)
reg_class any_reg(RAX, RAX_H,
RDX, RDX_H,
- RBP, RBP_H,
RDI, RDI_H,
RSI, RSI_H,
RCX, RCX_H,
[...]
Remove RBP from
register pools
Hacking OpenJDK (2/2)
•  We used this patched version successfully for some limited
(and urgent) performance analysis
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
+ mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
[...]
Fix x86_64 function
prologues
-XX:+PreserveFramePointer
•  We shared our patch publicly
–  See "A hotspot patch for stack profiling (frame pointer)" on the
hotspot complier dev mailing list
–  It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8,
and the -XX:+PreserveFramePointer option
•  Zoltán Majó (Oracle) took this on, rewrote it, and it is now:
–  In JDK 9
–  In JDK 8 update 60 build 19
–  Thanks to Zoltán, Oracle, and the other hotspot engineers for
helping get this done!
•  It might cost 0 – 3% CPU, depending on workload
Broken Java Stacks (before)
•  Check with "perf
script" to see stack
samples
•  These are 1 or 2
levels deep (junk
values)
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f417908c194 [unknown] (/tmp/perf-4458.map)
[…]
Fixed Java Stacks
•  With -XX:
+PreserveFramePointer
stacks are full, and
go all the way to
start_thread()
•  This is what the
CPUs are really
running: inlined
frames are not
present
# perf script
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
7fd30184def8 [unknown] (/tmp/perf-8131.map)
7fd30174f544 [unknown] (/tmp/perf-8131.map)
7fd30175d3a8 [unknown] (/tmp/perf-8131.map)
7fd30166d51c [unknown] (/tmp/perf-8131.map)
7fd301750f34 [unknown] (/tmp/perf-8131.map)
7fd3016c2280 [unknown] (/tmp/perf-8131.map)
7fd301b02ec0 [unknown] (/tmp/perf-8131.map)
7fd3016f9888 [unknown] (/tmp/perf-8131.map)
7fd3016ece04 [unknown] (/tmp/perf-8131.map)
7fd30177783c [unknown] (/tmp/perf-8131.map)
7fd301600aa8 [unknown] (/tmp/perf-8131.map)
7fd301a4484c [unknown] (/tmp/perf-8131.map)
7fd3010072e0 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd301007325 [unknown] (/tmp/perf-8131.map)
7fd3010004e7 [unknown] (/tmp/perf-8131.map)
7fd3171df76a JavaCalls::call_helper(JavaValue*,…
7fd3171dce44 JavaCalls::call_virtual(JavaValue*…
7fd3171dd43a JavaCalls::call_virtual(JavaValue*…
7fd31721b6ce thread_entry(JavaThread*, Thread*)…
7fd3175389e0 JavaThread::thread_main_inner() (/…
7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf…
7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/…
7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
Fixed Stacks Flame Graph
Java stacks
(but no symbols)
Stacks & Inlining
•  Frames may be missing (inlined)
•  Disabling inlining:
–  -XX:-Inline
–  Many more Java frames
–  Can be 80% slower!
•  May not be necessary
–  Inlined flame graphs often make
enough sense
–  Or tune -XX:MaxInlineSize and
-XX:InlineSmallCode a little to reveal more frames
•  Can even improve performance!
•  perf-map-agent (next) has experimental un-inline support
No inlining
Symbols
Missing Symbols
12.06% 62 sed sed [.] re_search_internal
|
--- re_search_internal
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
| __libc_start_main
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
broken
not broken
•  Missing symbols may show up as hex; e.g., Linux perf:
Fixing Symbols
•  For JIT'd code, Linux perf already looks for an
externally provided symbol file: /tmp/perf-PID.map, and
warns if it doesn't exist
•  This file can be created by a Java agent
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
Java Symbols for perf
•  perf-map-agent
–  https://github.com/jrudolph/perf-map-agent
–  Agent attaches and writes the /tmp file on demand (previous
versions attached on Java start, wrote continually)
–  Thanks Johannes Rudolph!
•  Use of a /tmp symbol file
–  Pros: simple, can be low overhead (snapshot on demand)
–  Cons: stale symbols
•  Using a symbol logger with perf instead
–  Patch by Stephane Eranian currently being discussed on
lkml; see "perf: add support for profiling jitted code"
Java Mixed-Mode Flame Graph
Stacks & Symbols
Java JVM
Kernel
GC
Stacks & Symbols (zoom)
Instructions
Instructions
1.  Check Java version
2.  Install perf-map-agent
3.  Set -XX:+PreserveFramePointer
4.  Profile Java
5.  Dump symbols
6.  Generate Mixed-Mode Flame Graph
Note these are unsupported: use at your own risk
Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
1. Check Java Version
•  Need JDK8u60 or better
–  for -XX:+PreserveFramePointer
•  Upgrade Java if necessary
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
2. Install perf-map-agent
•  Check https://github.com/jrudolph/perf-map-agent for the
latest instructions. e.g.:
$ sudo bash
# apt-get install -y cmake
# export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# cd /usr/lib/jvm
# git clone --depth=1 https://github.com/jrudolph/perf-map-agent
# cd perf-map-agent
# cmake .
# make
3. Set -XX:+PreserveFramePointer
•  Needs to be set on Java startup
•  Check it is enabled (on Linux):
$ ps wwp `pgrep –n java` | grep PreserveFramePointer
4. Profile Java
•  Using Linux perf_events to profile all processes, at 99
Hertz, for 30 seconds (as root):
•  Just profile one PID (broken on some older kernels):
•  These create a perf.data file
# perf record -F 99 -a -g -- sleep 30
# perf record -F 99 -p PID -g -- sleep 30
5. Dump Symbols
•  See perf-map-agent docs for updated usage
•  e.g., as the same user as java:
•  perf-map-agent contains helper scripts. I wrote my own:
–  https://github.com/brendangregg/Misc/blob/master/java/jmaps
•  Dump symbols quickly after perf record to minimize stale
symbols. How I do it:
$ cd /usr/lib/jvm/perf-map-agent/out
$ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar 
net.virtualvoid.perf.AttachOnce PID
# perf record -F 99 -a -g -- sleep 30; jmaps
6. Generate a Mixed-Mode Flame Graph
•  Using my FlameGraph software:
–  perf script reads perf.data with /tmp/*.map
–  out.stacks01 is an intermediate file; can be handy to keep
•  Finally open flame01.svg in a browser
•  Check for newer flame graph implementations (e.g., d3)
# perf script > out.stacks01
# git clone --depth=1 https://github.com/brendangregg/FlameGraph
# cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | 
./FlameGraph/flamegraph.pl --color=java --hash > flame01.svg
Automation
Netflix Vector
Netflix Vector
Near real-time,
per-second metrics
Flame Graphs
Select
Metrics
Select Instance
Netflix Vector
•  Open source, on-demand, instance analysis tool
–  https://github.com/netflix/vector
•  Shows various real-time metrics
•  Flame graph support currently in development
–  Automating previous steps
–  Using it internally already
–  Also developing a new d3 front end
DEMO
d3-flame-graph
Advanced Analysis
Linux perf_events Coverage
… all possible with Java stacks
Advanced Flame Graphs
•  Examples:
–  Page faults
–  Context switches
–  Disk I/O requests
–  TCP events
–  CPU cache misses
–  CPI
•  Any event issued in synchronous Java context
Synchronous Java Context
•  Java thread still on-CPU, and event is directly triggered
•  Examples:
–  Disk I/O requests issued directly by Java à yes
•  direct reads, sync writes, page faults
–  Disk I/O completion interrupts à no*
–  Disk I/O requests triggered async, e.g., readahead à no*
* can be made yes by tracing and associating context
Page Faults
•  Show what triggered main memory (resident) to grow:
•  "fault" as (physical) main memory is allocated on-
demand, when a virtual page is first populated
•  Low overhead tool to solve some types of memory leak
# perf record -e page-faults -p PID -g -- sleep 120
RES column in top(1) grows
because
Page Fault Flame Graph
GC
Java code
epoll
Context Switches
•  Show why Java blocked and stopped running on-CPU:
•  Identifies locks, I/O, sleeps
–  If code path shouldn't block and looks random, it's an involuntary context switch. I
could filter these, but you should have solved them beforehand (CPU load).
•  e.g., was used to understand framework differences:
# perf record -e context-switches -p PID -g -- sleep 5
vs
rxNetty Tomcat
Context Switch Flame Graph (1/2)
rxNetty
epoll futex
Context Switch Flame Graph (2/2)
Tomcat sys_poll
futex
Disk I/O Requests
•  Shows who issued disk I/O (sync reads & writes):
•  e.g.: page faults in GC? This JVM has swapped out!:
# perf record -e block:block_rq_insert -a -g -- sleep 60
GC
TCP Events
•  TCP transmit, using dynamic tracing:
•  Note: can be high overhead for high packet rates
–  For the current perf trace, dump, post-process cycle
•  Can also trace TCP connect & accept (lower overhead)
•  TCP receive is async
–  Could trace via socket read
# perf probe tcp_sendmsg
# perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks
# perf probe --del tcp_sendmsg
TCP Send Flame Graph
kernel
Java
JVM
Only one code-path
taken in this example
ab (client process)
CPU Cache Misses
•  In this example, sampling via Last Level Cache loads:
•  -c is the count (samples
once per count)
•  Use other CPU counters to
sample hits, misses, stalls
# perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks
One Last Example
•  Back to a
mixed-mode
CPU flame graph
•  What else can we
show with color?
CPI Flame Graph
•  Cycles Per
Instruction!
–  red == instruction
heavy
–  blue == cycle
heavy (likely mem
stall cycles)
zoomed:
Links & References
•  Flame Graphs
–  http://www.brendangregg.com/flamegraphs.html
–  http://techblog.netflix.com/2015/07/java-in-flames.html
–  http://techblog.netflix.com/2014/11/nodejs-in-flames.html
–  http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html
•  Linux perf_events
–  https://perf.wiki.kernel.org/index.php/Main_Page
–  http://www.brendangregg.com/perf.html
–  http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html
•  Netflix Vector
–  https://github.com/netflix/vector
–  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
•  JDK tickets
–  JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465
–  JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945
•  hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
Thanks
•  Questions?
•  http://techblog.netflix.com
•  http://slideshare.net/brendangregg
•  http://www.brendangregg.com
•  bgregg@netflix.com
•  @brendangregg
Oct	
  2015	
  

More Related Content

What's hot

Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
JVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovJVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovZeroTurnaround
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307Linaro
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?ScyllaDB
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaSpark Summit
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringErik Krogen
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdfAdrian Huang
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

What's hot (20)

Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
JVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovJVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir Ivanov
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
Introduction to SLURM
Introduction to SLURMIntroduction to SLURM
Introduction to SLURM
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Similar to Java Mixed-Mode Flame Graphs Reveal CPU Usage

Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsBrendan Gregg
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomValeriy Kravchuk
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in RustInfluxData
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
Systemtap
SystemtapSystemtap
SystemtapFeng Yu
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
 
Driver Debugging Basics
Driver Debugging BasicsDriver Debugging Basics
Driver Debugging BasicsBala Subra
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformWangda Tan
 
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggKernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggAnne Nicolas
 
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...HostedbyConfluent
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...DynamicInfraDays
 
Virtual platform
Virtual platformVirtual platform
Virtual platformsean chen
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems PerformanceBrendan Gregg
 

Similar to Java Mixed-Mode Flame Graphs Reveal CPU Usage (20)

Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Systemtap
SystemtapSystemtap
Systemtap
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
Driver Debugging Basics
Driver Debugging BasicsDriver Debugging Basics
Driver Debugging Basics
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan GreggKernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
Kernel Recipes 2017 - Using Linux perf at Netflix - Brendan Gregg
 
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
What’s Slowing Down Your Kafka Pipeline? With Ruizhe Cheng and Pete Stevenson...
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 

More from Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 

More from Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

Java Mixed-Mode Flame Graphs Reveal CPU Usage

  • 1. Java Mixed-Mode Flame Graphs Brendan Gregg Senior Performance Architect Oct  2015  
  • 2. Understanding Java CPU usage quickly and completely
  • 3. Quickly •  Via SSH and open source tools (covered in this talk) •  Or using Netflix Vector GUI (also open source): 1.  Observe high CPU usage 2.  Generate a flame graph
  • 4. Java Mixed-Mode Flame Graph via Linux perf_events: Completely Java JVM Kernel GC
  • 5. Messy House Fallacy •  Don't overlook system code: kernel, libraries, etc. Fallacy:  my  code  is  a  mess,  I  bet  yours  is   immaculate,  therefore  the  bug  must  be  mine     Reality:  everyone's  code  is  terrible  and  buggy  
  • 7. •  Over 60 million subscribers –  Just launched in Spain! •  AWS EC2 Linux cloud •  FreeBSD CDN •  Awesome place to work
  • 8. Cloud •  Tens of thousands of AWS EC2 instances •  Mostly running Java applications (Oracle JVM) Linux  (usually  Ubuntu)   Java  (JDK  8)   Tomcat  GC  and   thread   dump   logging   hystrix,  metrics  (Servo),   health  check   OpMonal  Apache,   memcached,  Node.js,   …   Atlas,  S3  log  rotaMon,   sar,  Trace,  perf,  stap,   perf-­‐tools   Vector,  pcp   ApplicaMon  war  files,   plaYorm,  base  servlet  
  • 9. Why we need CPU profiling •  Improving performance –  Identify tuning targets –  Incident response –  Non-regression testing –  Software evaluations –  CPU workload characterization •  Cost savings –  ASGs often scale on load average (CPU), so CPU usage is proportional to cost Instance   Instance   Instance   Scaling  Policy   loadavg,  latency,  …     CloudWatch,  Servo   Auto  Scaling   Group  
  • 10. The Problem with Profilers
  • 12. Java Profilers •  Visibility –  Java method execution –  Object usage –  GC logs –  Custom Java context •  Typical problems: –  Sampling often happens at safety/yield points (skew) –  Method tracing has massive observer effect –  Misidentifies RUNNING as on-CPU (e.g., epoll) –  Doesn't include or profile GC or JVM CPU time –  Tree views not quick (proportional) to comprehend •  Inaccurate (skewed) and incomplete profiles
  • 14. System Profilers •  Visibility –  JVM (C++) –  GC (C++) –  libraries (C) –  kernel (C) •  Typical problems (x86): –  Stacks missing for Java –  Symbols missing for Java methods •  Other architectures (e.g., SPARC) have fared better •  Profile everything except Java
  • 15. Workaround •  Capture both Java and system profiles, and examine side by side •  An improvement, but Java context is often crucial for interpreting system profiles Java System
  • 16. Java Mixed-Mode Flame Graph Solution Java JVM Kernel GC
  • 17. Solution •  Fix system profiling –  Only way to see it all •  Visibility is everything: –  Java methods –  JVM (C++) –  GC (C++) –  libraries (C) –  kernel (C) •  Minor Problems: –  0-3% CPU overhead to enable frame pointers (usually <1%). –  Symbol dumps can consume a burst of CPU •  Complete and accurate (asynchronous) profiling Java JVM Kernel GC
  • 18. Simple Production Example 1.  Poor performance, and one CPU at 100% 2.  perf_events flame graph shows JVM stuck compiling
  • 19. Another System Example Exception handling consuming CPU
  • 21. Exonerating The System •  From last week: -  Frequent thread creation/ destruction assumed to be consuming CPU resources. Recode application? -  A flame graph quantified this CPU time: near zero -  Time mostly other Java methods
  • 24. CPU Profiling A B block interrupt on-CPU off-CPU A B A A B A syscall time •  Record stacks at a timed interval: simple and effective –  Pros: Low (deterministic) overhead –  Cons: Coarse accuracy, but usually sufficient stack samples: A
  • 25. Stack Traces •  A code path snapshot. e.g., from jstack(1): $ jstack 1819 […] "main" prio=10 tid=0x00007ff304009000 nid=0x7361 runnable [0x00007ff30d4f9000] java.lang.Thread.State: RUNNABLE at Func_abc.func_c(Func_abc.java:6) at Func_abc.func_b(Func_abc.java:16) at Func_abc.func_a(Func_abc.java:23) at Func_abc.main(Func_abc.java:27) running parent g.parent g.g.paren running codepath start
  • 26. System Profilers •  Linux –  perf_events (aka "perf") •  Oracle Solaris –  DTrace •  OS X –  Instruments •  Windows –  XPerf •  And many others…
  • 27. Linux perf_events •  Standard Linux profiler –  Provides the perf command (multi-tool) –  Usually pkg added by linux-tools-common, etc. •  Features: –  Timer-based sampling –  Hardware events –  Tracepoints –  Dynamic tracing •  Can sample stacks of (almost) everything on CPU –  Can miss hard interrupt ISRs, but these should be near-zero. They can be measured if needed (I wrote my own tools)
  • 28. perf record Profiling •  Stack profiling on all CPUs at 99 Hertz, then dump: # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf script […] bash 13204 cpu-clock: 459c4c dequote_string (/root/bash-4.3/bash) 465c80 glob_expand_word_list (/root/bash-4.3/bash) 466569 expand_word_list_internal (/root/bash-4.3/bash) 465a13 expand_words (/root/bash-4.3/bash) 43bbf7 execute_simple_command (/root/bash-4.3/bash) 435f16 execute_command_internal (/root/bash-4.3/bash) 435580 execute_command (/root/bash-4.3/bash) 43a771 execute_while_or_until (/root/bash-4.3/bash) 43a636 execute_while_command (/root/bash-4.3/bash) 436129 execute_command_internal (/root/bash-4.3/bash) 435580 execute_command (/root/bash-4.3/bash) 420cd5 reader_loop (/root/bash-4.3/bash) 41ea58 main (/root/bash-4.3/bash) 7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so) [… ~47,000 lines truncated …] one stack sample
  • 29. perf report Summary •  Generates a call tree and combines samples: # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command [… ~13,000 lines truncated …] call tree summary
  • 31. perf report Verbosity •  Despite summarizing, output is still verbose # perf report -n -stdio […] # Overhead Samples Command Shared Object Symbol # ........ ............ ....... ................. ............................. # 20.42% 605 bash [kernel.kallsyms] [k] xen_hypercall_xen_version | --- xen_hypercall_xen_version check_events | |--44.13%-- syscall_trace_enter | tracesys | | | |--35.58%-- __GI___libc_fcntl | | | | | |--65.26%-- do_redirection_internal | | | do_redirections | | | execute_builtin_or_function | | | execute_simple_command [… ~13,000 lines truncated …]
  • 33. … as a Flame Graph
  • 34. Flame Graphs •  Flame Graphs: –  x-axis: alphabetical stack sort, to maximize merging –  y-axis: stack depth –  color: random (default), or a dimension •  Currently made from Perl + SVG + JavaScript –  Multiple d3 versions are being developed •  Easy to get working –  http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html –  Above commands are Linux; see URL for other OSes git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph perf record -F 99 -a –g -- sleep 30 perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
  • 35. Linux perf_events Workflow perf stat perf record perf report perf script count events capture stacks text UI dump profile stackcollapse-perf.pl flamegraph.pl perf.data   flame graph visualization perf list list events Typical Workflow
  • 36. Flame Graph Interpretation a() b() h() c() d() e() f() g() i()
  • 37. Flame Graph Interpretation (1/3) Top edge shows who is running on-CPU, and how much (width) a() b() h() c() d() e() f() g() i()
  • 38. Flame Graph Interpretation (2/3) Top-down shows ancestry e.g., from g(): h() d() e() i() a() b() c() f() g()
  • 39. Flame Graph Interpretation (3/3) a() b() h() c() d() e() f() g() i() Widths are proportional to presence in samples e.g., comparing b() to h() (incl. children)
  • 40. Flame Graph Colors •  Randomized by default •  Can be used as a dimension. e.g.: –  Mixed-mode flame graphs –  Differential flame graphs –  Search
  • 41. Mixed-Mode Flame Graphs •  Hues: –  green == Java –  red == system –  yellow == C++ •  Intensity randomized to differentiate frames –  Or hashed based on function name Java JVM Kernel Mixed-Mode
  • 42. Differential Flame Graphs •  Hues: –  red == more samples –  blue == less samples •  Intensity shows the degree of difference •  Used for comparing two profiles •  Also used for showing other metrics: e.g., CPI Differential more less
  • 43. Flame Graph Search •  Color: magenta to show matched frames search button
  • 44. Flame Charts •  Flame charts: x-axis is time •  Flame graphs: x-axis is population (maximize merging) •  Final note: these are useful, but are not flame graphs
  • 46. System Profiling Java on x86 •  For example, using Linux perf •  The stacks are 1 or 2 levels deep, and have junk values # perf record –F 99 –a –g – sleep 30 # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) […]
  • 47. … as a Flame Graph Broken Java stacks (missing frame pointer)
  • 48. Why Stacks are Broken •  On x86 (x86_64), hotspot uses the frame pointer register (RBP) as general purpose •  This "compiler optimization" breaks (simple) stack walking •  Once upon a time, x86 had fewer registers, and this made much more sense •  gcc provides -fno-omit-frame-pointer to avoid doing this, but the JVM had no such option…
  • 49. Fixing Stack Walking Possibilities: A.  Fix frame pointer-based stack walking (the default) –  Pros: simple, supported by many tools –  Cons: might cost a little extra CPU B.  Use a custom walker (likely needing kernel support) –  Pros: full stack walking (incl. inlining) & arguments –  Cons: custom kernel code, can cost more CPU when in use C.  Try libunwind and DWARF –  Even feasible with JIT? Our current preference is (A)
  • 50. Hacking OpenJDK (1/2) •  As a proof of concept, I hacked hotspot to support an x86_64 frame pointer --- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 … +++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 … @@ -166,10 +166,9 @@ // 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ ) // -// Class for all pointer registers (including RSP) +// Class for all pointer registers (including RSP, excluding RBP) reg_class any_reg(RAX, RAX_H, RDX, RDX_H, - RBP, RBP_H, RDI, RDI_H, RSI, RSI_H, RCX, RCX_H, [...] Remove RBP from register pools
  • 51. Hacking OpenJDK (2/2) •  We used this patched version successfully for some limited (and urgent) performance analysis --- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04… +++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 … @@ -5236,6 +5236,7 @@ // We always push rbp, so that on return to interpreter rbp, will be // restored correctly and we can correct the stack. push(rbp); + mov(rbp, rsp); // Remove word for ebp framesize -= wordSize; --- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … +++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp … [...] Fix x86_64 function prologues
  • 52. -XX:+PreserveFramePointer •  We shared our patch publicly –  See "A hotspot patch for stack profiling (frame pointer)" on the hotspot complier dev mailing list –  It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8, and the -XX:+PreserveFramePointer option •  Zoltán Majó (Oracle) took this on, rewrote it, and it is now: –  In JDK 9 –  In JDK 8 update 60 build 19 –  Thanks to Zoltán, Oracle, and the other hotspot engineers for helping get this done! •  It might cost 0 – 3% CPU, depending on workload
  • 53. Broken Java Stacks (before) •  Check with "perf script" to see stack samples •  These are 1 or 2 levels deep (junk values) # perf script […] java 4579 cpu-clock: ffffffff8172adff tracesys ([kernel.kallsyms]) 7f4183bad7ce pthread_cond_timedwait@@GLIBC_2… java 4579 cpu-clock: 7f417908c10b [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101c97 [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f41792fc65f [unknown] (/tmp/perf-4458.map) a2d53351ff7da603 [unknown] ([unknown]) java 4579 cpu-clock: 7f4179349aec [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f4179101d0f [unknown] (/tmp/perf-4458.map) java 4579 cpu-clock: 7f417908c194 [unknown] (/tmp/perf-4458.map) […]
  • 54. Fixed Java Stacks •  With -XX: +PreserveFramePointer stacks are full, and go all the way to start_thread() •  This is what the CPUs are really running: inlined frames are not present # perf script […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) 7fd30184def8 [unknown] (/tmp/perf-8131.map) 7fd30174f544 [unknown] (/tmp/perf-8131.map) 7fd30175d3a8 [unknown] (/tmp/perf-8131.map) 7fd30166d51c [unknown] (/tmp/perf-8131.map) 7fd301750f34 [unknown] (/tmp/perf-8131.map) 7fd3016c2280 [unknown] (/tmp/perf-8131.map) 7fd301b02ec0 [unknown] (/tmp/perf-8131.map) 7fd3016f9888 [unknown] (/tmp/perf-8131.map) 7fd3016ece04 [unknown] (/tmp/perf-8131.map) 7fd30177783c [unknown] (/tmp/perf-8131.map) 7fd301600aa8 [unknown] (/tmp/perf-8131.map) 7fd301a4484c [unknown] (/tmp/perf-8131.map) 7fd3010072e0 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd301007325 [unknown] (/tmp/perf-8131.map) 7fd3010004e7 [unknown] (/tmp/perf-8131.map) 7fd3171df76a JavaCalls::call_helper(JavaValue*,… 7fd3171dce44 JavaCalls::call_virtual(JavaValue*… 7fd3171dd43a JavaCalls::call_virtual(JavaValue*… 7fd31721b6ce thread_entry(JavaThread*, Thread*)… 7fd3175389e0 JavaThread::thread_main_inner() (/… 7fd317538cb2 JavaThread::run() (/usr/lib/jvm/nf… 7fd3173f6f52 java_start(Thread*) (/usr/lib/jvm/… 7fd317a7e182 start_thread (/lib/x86_64-linux-gn…
  • 55. Fixed Stacks Flame Graph Java stacks (but no symbols)
  • 56. Stacks & Inlining •  Frames may be missing (inlined) •  Disabling inlining: –  -XX:-Inline –  Many more Java frames –  Can be 80% slower! •  May not be necessary –  Inlined flame graphs often make enough sense –  Or tune -XX:MaxInlineSize and -XX:InlineSmallCode a little to reveal more frames •  Can even improve performance! •  perf-map-agent (next) has experimental un-inline support No inlining
  • 58. Missing Symbols 12.06% 62 sed sed [.] re_search_internal | --- re_search_internal | |--96.78%-- re_search_stub | rpl_re_search | match_regex | do_subst | execute_program | process_files | main | __libc_start_main 71.79% 334 sed sed [.] 0x000000000001afc1 | |--11.65%-- 0x40a447 | 0x40659a | 0x408dd8 | 0x408ed1 | 0x402689 | 0x7fa1cd08aec5 broken not broken •  Missing symbols may show up as hex; e.g., Linux perf:
  • 59. Fixing Symbols •  For JIT'd code, Linux perf already looks for an externally provided symbol file: /tmp/perf-PID.map, and warns if it doesn't exist •  This file can be created by a Java agent # perf script Failed to open /tmp/perf-8131.map, continuing without symbols […] java 8131 cpu-clock: 7fff76f2dce1 [unknown] ([vdso]) 7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm… 7fd301861e46 [unknown] (/tmp/perf-8131.map) […]
  • 60. Java Symbols for perf •  perf-map-agent –  https://github.com/jrudolph/perf-map-agent –  Agent attaches and writes the /tmp file on demand (previous versions attached on Java start, wrote continually) –  Thanks Johannes Rudolph! •  Use of a /tmp symbol file –  Pros: simple, can be low overhead (snapshot on demand) –  Cons: stale symbols •  Using a symbol logger with perf instead –  Patch by Stephane Eranian currently being discussed on lkml; see "perf: add support for profiling jitted code"
  • 61. Java Mixed-Mode Flame Graph Stacks & Symbols Java JVM Kernel GC
  • 64. Instructions 1.  Check Java version 2.  Install perf-map-agent 3.  Set -XX:+PreserveFramePointer 4.  Profile Java 5.  Dump symbols 6.  Generate Mixed-Mode Flame Graph Note these are unsupported: use at your own risk Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
  • 65. 1. Check Java Version •  Need JDK8u60 or better –  for -XX:+PreserveFramePointer •  Upgrade Java if necessary $ java -version java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
  • 66. 2. Install perf-map-agent •  Check https://github.com/jrudolph/perf-map-agent for the latest instructions. e.g.: $ sudo bash # apt-get install -y cmake # export JAVA_HOME=/usr/lib/jvm/java-8-oracle # cd /usr/lib/jvm # git clone --depth=1 https://github.com/jrudolph/perf-map-agent # cd perf-map-agent # cmake . # make
  • 67. 3. Set -XX:+PreserveFramePointer •  Needs to be set on Java startup •  Check it is enabled (on Linux): $ ps wwp `pgrep –n java` | grep PreserveFramePointer
  • 68. 4. Profile Java •  Using Linux perf_events to profile all processes, at 99 Hertz, for 30 seconds (as root): •  Just profile one PID (broken on some older kernels): •  These create a perf.data file # perf record -F 99 -a -g -- sleep 30 # perf record -F 99 -p PID -g -- sleep 30
  • 69. 5. Dump Symbols •  See perf-map-agent docs for updated usage •  e.g., as the same user as java: •  perf-map-agent contains helper scripts. I wrote my own: –  https://github.com/brendangregg/Misc/blob/master/java/jmaps •  Dump symbols quickly after perf record to minimize stale symbols. How I do it: $ cd /usr/lib/jvm/perf-map-agent/out $ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar net.virtualvoid.perf.AttachOnce PID # perf record -F 99 -a -g -- sleep 30; jmaps
  • 70. 6. Generate a Mixed-Mode Flame Graph •  Using my FlameGraph software: –  perf script reads perf.data with /tmp/*.map –  out.stacks01 is an intermediate file; can be handy to keep •  Finally open flame01.svg in a browser •  Check for newer flame graph implementations (e.g., d3) # perf script > out.stacks01 # git clone --depth=1 https://github.com/brendangregg/FlameGraph # cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl --color=java --hash > flame01.svg
  • 73. Netflix Vector Near real-time, per-second metrics Flame Graphs Select Metrics Select Instance
  • 74. Netflix Vector •  Open source, on-demand, instance analysis tool –  https://github.com/netflix/vector •  Shows various real-time metrics •  Flame graph support currently in development –  Automating previous steps –  Using it internally already –  Also developing a new d3 front end
  • 77. Linux perf_events Coverage … all possible with Java stacks
  • 78. Advanced Flame Graphs •  Examples: –  Page faults –  Context switches –  Disk I/O requests –  TCP events –  CPU cache misses –  CPI •  Any event issued in synchronous Java context
  • 79. Synchronous Java Context •  Java thread still on-CPU, and event is directly triggered •  Examples: –  Disk I/O requests issued directly by Java à yes •  direct reads, sync writes, page faults –  Disk I/O completion interrupts à no* –  Disk I/O requests triggered async, e.g., readahead à no* * can be made yes by tracing and associating context
  • 80. Page Faults •  Show what triggered main memory (resident) to grow: •  "fault" as (physical) main memory is allocated on- demand, when a virtual page is first populated •  Low overhead tool to solve some types of memory leak # perf record -e page-faults -p PID -g -- sleep 120 RES column in top(1) grows because
  • 81. Page Fault Flame Graph GC Java code epoll
  • 82. Context Switches •  Show why Java blocked and stopped running on-CPU: •  Identifies locks, I/O, sleeps –  If code path shouldn't block and looks random, it's an involuntary context switch. I could filter these, but you should have solved them beforehand (CPU load). •  e.g., was used to understand framework differences: # perf record -e context-switches -p PID -g -- sleep 5 vs rxNetty Tomcat
  • 83. Context Switch Flame Graph (1/2) rxNetty epoll futex
  • 84. Context Switch Flame Graph (2/2) Tomcat sys_poll futex
  • 85. Disk I/O Requests •  Shows who issued disk I/O (sync reads & writes): •  e.g.: page faults in GC? This JVM has swapped out!: # perf record -e block:block_rq_insert -a -g -- sleep 60 GC
  • 86. TCP Events •  TCP transmit, using dynamic tracing: •  Note: can be high overhead for high packet rates –  For the current perf trace, dump, post-process cycle •  Can also trace TCP connect & accept (lower overhead) •  TCP receive is async –  Could trace via socket read # perf probe tcp_sendmsg # perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks # perf probe --del tcp_sendmsg
  • 87. TCP Send Flame Graph kernel Java JVM Only one code-path taken in this example ab (client process)
  • 88. CPU Cache Misses •  In this example, sampling via Last Level Cache loads: •  -c is the count (samples once per count) •  Use other CPU counters to sample hits, misses, stalls # perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps # perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks
  • 89. One Last Example •  Back to a mixed-mode CPU flame graph •  What else can we show with color?
  • 90. CPI Flame Graph •  Cycles Per Instruction! –  red == instruction heavy –  blue == cycle heavy (likely mem stall cycles) zoomed:
  • 91. Links & References •  Flame Graphs –  http://www.brendangregg.com/flamegraphs.html –  http://techblog.netflix.com/2015/07/java-in-flames.html –  http://techblog.netflix.com/2014/11/nodejs-in-flames.html –  http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html •  Linux perf_events –  https://perf.wiki.kernel.org/index.php/Main_Page –  http://www.brendangregg.com/perf.html –  http://www.brendangregg.com/blog/2015-02-27/linux-profiling-at-netflix.html •  Netflix Vector –  https://github.com/netflix/vector –  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html •  JDK tickets –  JDK8: https://bugs.openjdk.java.net/browse/JDK-8072465 –  JDK9: https://bugs.openjdk.java.net/browse/JDK-8068945 •  hprof: http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
  • 92. Thanks •  Questions? •  http://techblog.netflix.com •  http://slideshare.net/brendangregg •  http://www.brendangregg.com •  bgregg@netflix.com •  @brendangregg Oct  2015