SlideShare a Scribd company logo
1 of 86
Download to read offline
Performance Analysis:
new tools and concepts
from the cloud
Brendan Gregg
Lead Performance Engineer, Joyent
brendan.gregg@joyent.com

SCaLE10x
Jan, 2012
whoami
• I do performance analysis
• I also write performance tools out of necessity
• Was Brendan @ Sun Microsystems, Oracle,
now Joyent
Joyent
• Cloud computing provider
• Cloud computing software
• SmartOS
• host OS, and guest via OS virtualization
• Linux, Windows
• guest via KVM
Agenda
• Data
• Example problems & solutions
• How cloud environments complicate performance
• Theory
• Performance analysis
• Summarize new tools & concepts
• This talk uses SmartOS and DTrace to illustrate
concepts that are applicable to most OSes.
Data
• Example problems:
• CPU
• Memory
• Disk
• Network
• Some have neat solutions, some messy, some none
• This is real world
• Some I’ve covered before, some I haven’t
CPU
CPU utilization: problem
• Would like to identify:
• single or multiple CPUs at 100% utilization
• average, minimum and maximum CPU utilization
• CPU utilization balance (tight or loose distribution)
• time-based characteristics
changing/bursting? burst interval, burst length

• For small to large environments
• entire datacenters or clouds
CPU utilization
• mpstat(1) has the data. 1 second, 1 server (16 CPUs):
CPU utilization
• Scaling to 60 seconds, 1 server:
CPU utilization
• Scaling to entire datacenter, 60 secs, 5312 CPUs:
CPU utilization
• Line graphs can solve some problems:
• x-axis: time, 60 seconds
• y-axis: utilization
CPU utilization
• ... but don’t scale well to individual devices
• 5312 CPUs, each as a line:
CPU utilization
• Pretty, but scale limited as well:
CPU utilization
• Utilization as a heat map:
• x-axis: time, y-axis: utilization
• z-axis (color): number of CPUs
CPU utilization
• Available in Cloud Analytics (Joyent)
• Clicking highlights and shows details; eg, hostname:
CPU utilization
• Utilization heat map also suitable and used for:
• disks
• network interfaces
• Utilization as a metric can be a bit misleading
• really a percent busy over a time interval
• devices may accept more work at 100% busy
• may not directly relate to performance impact
CPU utilization: summary
• Data readily available
• Using a new visualization
CPU usage
• Given a CPU is hot, what is it doing?
• Beyond just vmstat’s usr/sys ratio
• Profiling (sampling at an interval) the program
counter or stack back trace

• user-land stack for %usr
• kernel stack for %sys
• Many tools can do this to some degree
• Developer Studios/DTrace/oprofile/...
CPU usage: profiling
• Frequency count on-CPU user-land stack traces:
# dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
@[ustack()] = count(); } tick-60s { exit(0); }'
dtrace: description 'profile-997 ' matched 2 probes
CPU
ID
FUNCTION:NAME
1 75195
:tick-60s
[...]
libc.so.1`__priocntlset+0xa
libc.so.1`getparam+0x83
libc.so.1`pthread_getschedparam+0x3c
libc.so.1`pthread_setschedprio+0x1f
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
4884
mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
5530
CPU usage: profiling
• Frequency count on-CPU user-land stack traces:
# dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ {
@[ustack()] = count(); } tick-60s { exit(0); }'
dtrace: description 'profile-997 ' matched 2 probes
CPU
ID
FUNCTION:NAME
1 75195
:tick-60s
[...]
libc.so.1`__priocntlset+0xa
libc.so.1`getparam+0x83
libc.so.1`pthread_getschedparam+0x3c
libc.so.1`pthread_setschedprio+0x1f
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
4884

Over
500,000
lines
truncated

mysqld`_Z13add_to_statusP17system_status_varS0_+0x47
mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222
mysqld`_Z10do_commandP3THD+0x198
mysqld`handle_one_connection+0x1a6
libc.so.1`_thrp_setup+0x8d
libc.so.1`_lwp_start
5530
CPU usage: profiling data
CPU usage: visualization
• Visualized as a “Flame Graph”:
CPU usage: Flame Graphs
• Just some Perl that turns DTrace output into an

interactive SVG: mouse-over elements for details

• It’s on github
• http://github.com/brendangregg/FlameGraph

• Works on kernel stacks, and both user+kernel
• Shouldn’t be hard to have it process oprofile, etc.
CPU usage: on the Cloud
• Flame Graphs were born out of necessity on
Cloud environments:

• Perf issues need quick resolution
(you just got hackernews’d)

• Everyone is running different versions of everything
(don’t assume you’ve seen the last of old CPU-hot
code-path issues that have been fixed)
CPU usage: summary
• Data can be available
• For cloud computing: easy for operators to fetch on
OS virtualized environments; otherwise agent driven,
and possibly other difficulties (access to CPU
instrumentation counter-based interrupts)

• Using a new visualization
CPU latency
• CPU dispatcher queue latency
• thread is ready-to-run, and waiting its turn
• Observable in coarse ways:
• vmstat’s r
• high load averages
• Less course, with microstate accounting
• prstat -mL’s LAT
• How much is it affecting application performance?
CPU latency: zonedispqlat.d
• Using DTrace to trace kernel scheduler events:
#./zonedisplat.d
Tracing...
Note: outliers (> 1 secs) may be artifacts due to the use of scalar globals
(sorry).
CPU disp queue latency by zone (ns):
dbprod-045
value
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
[...]

------------- Distribution ------------- count
|
0
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10210
|@@@@@@@@@@
3829
|@
514
|
94
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
0
|
0
|
1
|
0
CPU latency: zonedispqlat.d
• CPU dispatcher queue latency by zonename
(zonedispqlat.d), work in progress:

#!/usr/sbin/dtrace -s
#pragma D option quiet
dtrace:::BEGIN
{
printf("Tracing...n");
printf("Note: outliers (> 1 secs) may be artifacts due to the ");
printf("use of scalar globals (sorry).nn");
}
sched:::enqueue
{
/* scalar global (I don't think this can be thread local) */
start[args[0]->pr_lwpid, args[1]->pr_pid] = timestamp;
}
sched:::dequeue
/this->start = start[args[0]->pr_lwpid, args[1]->pr_pid]/
{
this->time = timestamp - this->start;
/* workaround since zonename isn't a member of args[1]... */
this->zone = ((proc_t *)args[1]->pr_addr)->p_zone->zone_name;
@[stringof(this->zone)] = quantize(this->time);
start[args[0]->pr_lwpid, args[1]->pr_pid] = 0;
}
tick-1sec
{
printf("CPU disp queue latency by zone (ns):n");
printa(@);
trunc(@);
}

Save timestamp
on enqueue;
calculate delta
on dequeue
CPU latency: zonedispqlat.d
• Instead of zonename, this could be process name, ...
• Tracing scheduler enqueue/dequeue events and
saving timestamps costs CPU overhead

• they are frequent
• I’d prefer to only trace dequeue, and reuse the
existing microstate accounting timestamps

• but one problem is a clash between unscaled and
scaled timestamps
CPU latency: on the Cloud
• With virtualization, you can have:

high CPU latency with idle CPUs
due to an instance consuming their quota

• OS virtualization
• not visible in vmstat r
• is visible as part of prstat -mL’s LAT
• more kstats recently added to SmartOS
including nsec_waitrq (total run queue wait by zone)

• Hardware virtualization
• vmstat st (stolen)
CPU latency: caps
• CPU cap latency from the host (zonecapslat.d):
#!/usr/sbin/dtrace -s
#pragma D option quiet
sched:::cpucaps-sleep
{
start[args[0]->pr_lwpid, args[1]->pr_pid] = timestamp;
}
sched:::cpucaps-wakeup
/this->start = start[args[0]->pr_lwpid, args[1]->pr_pid]/
{
this->time = timestamp - this->start;
/* workaround since zonename isn't a member of args[1]... */
this->zone = ((proc_t *)args[1]->pr_addr)->p_zone->zone_name;
@[stringof(this->zone)] = quantize(this->time);
start[args[0]->pr_lwpid, args[1]->pr_pid] = 0;
}
tick-1sec
{
printf("CPU caps latency by zone (ns):n");
printa(@);
trunc(@);
}
CPU latency: summary
• Partial data available
• New tools/metrics created
• although current DTrace solutions have overhead;
we should be able to improve that

• although, new kstats may be sufficient
Memory
Memory: problem
• Riak database has endless memory growth.
• expected 9GB, after two days:
$ prstat -c 1
Please wait...
PID USERNAME SIZE
RSS STATE
21722 103
43G
40G cpu0
15770 root
7760K 540K sleep
95 root
0K
0K sleep
12827 root
128M
73M sleep
10319 bgregg
10M 6788K sleep
10402 root
22M 288K sleep
[...]

PRI NICE
59
0
57
0
99 -20
100
59
0
59
0

TIME
72:23:41
23:28:57
7:37:47
0:49:36
0:00:00
0:18:45

CPU
2.6%
0.9%
0.2%
0.1%
0.0%
0.0%

PROCESS/NLWP
beam.smp/594
zoneadmd/5
zpool-zones/166
node/5
sshd/1
dtrace/1

• Eventually hits paging and terrible performance
• needing a restart
• Is this a memory leak?

Or application growth?
Memory: scope
• Identify the subsystem and team responsible
Subsystem

Team

Application

Voxer

Riak

Basho

Erlang

Ericsson

SmartOS

Joyent
Memory: heap profiling
• What is in the heap?
$ pmap 14719
14719:
beam.smp
0000000000400000
000000000062D000
000000000067F000
00000001005C0000
00000002005BE000
0000000300382000
00000004002E2000
00000004FFFD3000
00000005FFF91000
00000006FFF4C000
00000007FF9EF000
[...]

2168K
328K
4193540K
4194296K
4192016K
4193664K
4191172K
4194040K
4194028K
4188812K
588224K

r-x-rw--rw--rw--rw--rw--rw--rw--rw--rw--rw---

/opt/riak/erts-5.8.5/bin/beam.smp
/opt/riak/erts-5.8.5/bin/beam.smp
/opt/riak/erts-5.8.5/bin/beam.smp
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ anon ]
[ heap ]

• ... and why does it keep growing?
• Would like to answer these in production
• Without restarting apps. Experimentation (backend=mmap,
other allocators) wasn’t working.
Memory: heap profiling
• libumem was used for multi-threaded performance
• libumem == user-land slab allocator
• detailed observability can be enabled, allowing heap
profiling and leak detection

• While designed with speed and production use in
mind, it still comes with some cost (time and space),
and aren’t on by default.

• UMEM_DEBUG=audit
Memory: heap profiling
• libumem provides some default observability
• Eg, slabs:
> ::umem_malloc_info
CACHE
BUFSZ MAXMAL BUFMALLC
0000000000707028
8
0
0
000000000070b028
16
8
8730
000000000070c028
32
16
8772
000000000070f028
48
32 1148038
0000000000710028
64
48
344138
0000000000711028
80
64
36
0000000000714028
96
80
8934
0000000000715028
112
96 1347040
0000000000718028
128
112
253107
000000000071a028
160
144
40529
000000000071b028
192
176
140
000000000071e028
224
208
43
000000000071f028
256
240
133
0000000000720028
320
304
56
0000000000723028
384
368
35
[...]

AVG_MAL
0
8
16
25
40
62
79
87
111
118
155
188
229
276
335

MALLOCED
0
69836
140352
29127788
13765658
2226
705348
117120208
28011923
4788681
21712
8101
30447
15455
11726

OVERHEAD
0
1054998
1130491
156179051
58417287
4806
1168558
190389780
42279506
6466801
25818
6497
26211
12276
7220

%OVER
0.0%
1510.6%
805.4%
536.1%
424.3%
215.9%
165.6%
162.5%
150.9%
135.0%
118.9%
80.1%
86.0%
79.4%
61.5%
Memory: heap profiling
• ... and heap (captured @14GB RSS):
> ::vmem
ADDR
NAME
fffffd7ffebed4a0 sbrk_top
fffffd7ffebee0a8
sbrk_heap
fffffd7ffebeecb0
vmem_internal
fffffd7ffebef8b8
vmem_seg
fffffd7ffebf04c0
vmem_hash
fffffd7ffebf10c8
vmem_vmem
00000000006e7000
umem_internal
00000000006e8000
umem_cache
00000000006e9000
umem_hash
00000000006ea000
umem_log
00000000006eb000
umem_firewall_va
00000000006ec000
umem_firewall
00000000006ed000
umem_oversize
00000000006f0000
umem_memalign
0000000000706000
umem_default

INUSE
9090404352
9090404352
664616960
651993088
12583424
46200
352862464
113696
13091328
0
0
0
5218777974
0
2552131584

TOTAL
14240165888
9090404352
664616960
651993088
12587008
55344
352866304
180224
13099008
0
0
0
5520789504
0
2552131584

SUCCEED FAIL
4298117 84403
4298117
0
79621
0
79589
0
27
0
15
0
88746
0
44
0
86
0
0
0
0
0
0
0
3822051
0
0
0
307699
0

• The heap is 9 GB (as expected), but sbrk_top total is
14 GB (equal to RSS). And growing.

• Are there Gbyte-sized malloc()/free()s?
Memory: malloc() profiling
# dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 3 probes
^C
value ------------- Distribution ------------- count
2 |
0
4 |
3
8 |@
5927
16 |@@@@
41818
32 |@@@@@@@@@
81991
64 |@@@@@@@@@@@@@@@@@@
169888
128 |@@@@@@@
69891
256 |
2257
512 |
406
1024 |
893
2048 |
146
4096 |
1467
8192 |
755
16384 |
950
32768 |
83
65536 |
31
131072 |
11
262144 |
15
524288 |
0
1048576 |
1
2097152 |
0

• No huge malloc()s, but RSS continues to climb.
Memory: malloc() profiling
# dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472
dtrace: description 'pid$target::malloc:entry ' matched 3 probes
^C
value ------------- Distribution ------------- count
2 |
0
4 |
3
8 |@
5927
16 |@@@@
41818
32 |@@@@@@@@@
81991
64 |@@@@@@@@@@@@@@@@@@
169888
128 |@@@@@@@
69891
256 |
2257
512 |
406
1024 |
893
2048 |
146
4096 |
1467
8192 |
755
16384 |
950
32768 |
83
65536 |
31
131072 |
11
262144 |
15
524288 |
0
1048576 |
1
2097152 |
0

This tool (one-liner)
profiles malloc()
request sizes

• No huge malloc()s, but RSS continues to climb.
Memory: heap growth
• Tracing why the heap grows via brk():
# dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }'
dtrace: description 'syscall::brk:entry ' matched 1 probe
CPU
ID
FUNCTION:NAME
10
18
brk:entry
libc.so.1`_brk_unlocked+0xa
libumem.so.1`vmem_sbrk_alloc+0x84
libumem.so.1`vmem_xalloc+0x669
libumem.so.1`vmem_alloc+0x14f
libumem.so.1`vmem_xalloc+0x669
libumem.so.1`vmem_alloc+0x14f
libumem.so.1`umem_alloc+0x72
libumem.so.1`malloc+0x59
libstdc++.so.6.0.14`_Znwm+0x20
libstdc++.so.6.0.14`_Znam+0x9
eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea...
eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S...
eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc
eleveldb.so`eleveldb_get+0xd3
beam.smp`process_main+0x6939
beam.smp`sched_thread_func+0x1cf
beam.smp`thr_wrapper+0xbe

This shows
the user-land
stack trace
for every
heap growth
Memory: heap growth
• More DTrace showed the size of the malloc()s
causing the brk()s:

# dtrace -x dynvarsize=4m -n '
pid$target::malloc:entry { self->size = arg0; }
syscall::brk:entry /self->size/ { printf("%d bytes", self->size); }
pid$target::malloc:return { self->size = 0; }' -p 17472
dtrace:
CPU
0
0

description 'pid$target::malloc:entry ' matched 7 probes
ID
FUNCTION:NAME
44
brk:entry 8343520 bytes
44
brk:entry 8343520 bytes

[...]

• These 8 Mbyte malloc()s grew the heap
• Even though the heap has Gbytes not in use
• This is starting to look like an OS issue
Memory: allocator internals
• More tools were created:
• Show memory entropy (+ malloc - free)
along with heap growth, over time

• Show codepath taken for allocations
compare successful with unsuccessful (heap growth)

• Show allocator internals: sizes, options, flags
• And run in the production environment
• Briefly; tracing frequent allocs does cost overhead
• Casting light into what was a black box
Memory: allocator internals
4
<- vmem_xalloc
0
4
-> _sbrk_grow_aligned
4096
4
<- _sbrk_grow_aligned
17155911680
4
-> vmem_xalloc
7356400
4
| vmem_xalloc:entry
umem_oversize
4
-> vmem_alloc
7356416
4
-> vmem_xalloc
7356416
4
| vmem_xalloc:entry
sbrk_heap
4
-> vmem_sbrk_alloc
7356416
4
-> vmem_alloc
7356416
4
-> vmem_xalloc
7356416
4
| vmem_xalloc:entry
sbrk_top
4
-> vmem_reap
16777216
4
<- vmem_reap
3178535181209758
4
| vmem_xalloc:return vmem_xalloc() == NULL, vm:
sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0,
vmflag: 1
libumem.so.1`vmem_xalloc+0x80f
libumem.so.1`vmem_sbrk_alloc+0x33
libumem.so.1`vmem_xalloc+0x669
libumem.so.1`vmem_alloc+0x14f
libumem.so.1`vmem_xalloc+0x669
libumem.so.1`vmem_alloc+0x14f
libumem.so.1`umem_alloc+0x72
libumem.so.1`malloc+0x59
libstdc++.so.6.0.3`_Znwm+0x2b
libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e
Memory: solution
• These new tools and metrics pointed to the
allocation algorithm “instant fit”

• Someone had suggested this earlier; the tools provided solid evidence that this
really was the case here

• A new version of libumem was built to force use of
VM_BESTFIT

• and added by Robert Mustacchi as a tunable:
UMEM_OPTIONS=allocator=best

• Customer restarted Riak with new libumem version
• Problem solved
Memory: on the Cloud
• With OS virtualization, you can have:
Paging without scanning

• paging == swapping blocks with physical storage
• swapping == swapping entire threads between main
memory and physical storage

• Resource control paging is unrelated to the page
scanner, so, no vmstat scan rate (sr) despite
anonymous paging

• More new tools: DTrace sysinfo:::anonpgin by
process name, zonename
Memory: summary
• Superficial data available, detailed info not
• not by default
• Many new tools were created
• not easy, but made possible with DTrace
Disk
Disk: problem
• Application performance issues
• Disks look busy (iostat)
• Blame the disks?
$ iostat -xnz 1
[...]
r/s
124.0

w/s
334.9

r/s
114.0

w/s
407.1

r/s
85.0

w/s
438.0

extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
15677.2 40484.9 0.0 1.0
0.0
2.2
1 69 c0t1d0
extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
14595.9 49409.1 0.0 0.8
0.0
1.5
1 56 c0t1d0
extended device statistics
kr/s
kw/s wait actv wsvc_t asvc_t %w %b device
10814.8 53242.1 0.0 0.8
0.0
1.6
1 57 c0t1d0

• Many graphical tools are built upon iostat
Disk: on the Cloud
• Tenants can’t see each other
• Maybe a neighbor is doing a backup?
• Maybe a neighbor is running a benchmark?
• Can’t see their processes (top/prstat)
• Blame what you can’t see
Disk: VFS
• Applications usually talk to a file system
• and are hurt by file system latency
• Disk I/O can be:
• unrelated to the application: asynchronous tasks
• inflated from what the application requested
• deflated

“ “

• blind to issues caused higher up the kernel stack
Disk: issues with iostat(1)
• Unrelated:
• other applications / tenants
• file system prefetch
• file system dirty data fushing
• Inflated:
• rounded up to the next file system record size
• extra metadata for on-disk format
• read-modify-write of RAID5
Disk: issues with iostat(1)
• Deflated:
• read caching
• write buffering
• Blind:
• lock contention in the file system
• CPU usage by the file system
• file system software bugs
• file system queue latency
Disk: issues with iostat(1)
• blind (continued):
• disk cache flush latency (if your file system does it)
• file system I/O throttling latency
• I/O throttling is a new ZFS feature for cloud
environments

• adds artificial latency to file system I/O to throttle it
• added by Bill Pijewski and Jerry Jelenik of Joyent
Disk: file system latency
• Using DTrace to summarize ZFS read latency:
$ dtrace -n 'fbt::zfs_read:entry { self->start = timestamp; }
fbt::zfs_read:return /self->start/ {
@["ns"] = quantize(timestamp - self->start); self->start = 0; }'
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
^C
ns

value
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216

------------- Distribution ------------- count
|
0
|@
6
|@@
18
|@@@@@@@
79
|@@@@@@@@@@@@@@@@@
191
|@@@@@@@@@@
112
|@
14
|
1
|
1
|
0
|
0
|
0
|
0
|@@@
31
|@
9
|
0
Disk: file system latency
• Using DTrace to summarize ZFS read latency:
$ dtrace -n 'fbt::zfs_read:entry { self->start = timestamp; }
fbt::zfs_read:return /self->start/ {
@["ns"] = quantize(timestamp - self->start); self->start = 0; }'
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
^C
ns

value
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216

------------- Distribution ------------- count
|
0
|@
6
|@@
18
|@@@@@@@
79
|@@@@@@@@@@@@@@@@@
191
|@@@@@@@@@@
112
|@
14
|
1
|
1
|
0
|
0
|
0
|
0
|@@@
31
|@
9
|
0

Cache reads

Disk reads
Disk: file system latency
• Tracing zfs events using zfsslower.d:
# ./zfsslower.d 10
TIME
2011 May 17 01:23:12
2011 May 17 01:23:13
2011 May 17 01:23:33
2011 May 17 01:23:33
2011 May 17 01:23:51
^C

PROCESS
mysqld
mysqld
mysqld
mysqld
httpd

D
R
W
W
W
R

KB
16
16
16
16
56

ms
19
10
11
10
14

FILE
/z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd
/z01/var/mysql/xxxxx/xxxxx.ibd
/z01/var/mysql/xxxxx/xxxxx.ibd
/z01/var/mysql/xxxxx/xxxxx.ibd
/z01/home/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx

• Argument is the minimum latency in milliseconds
Disk: file system latency
• Can trace this from other locations too:
• VFS layer: filter on desired file system types
• syscall layer: filter on file descriptors for file systems
• application layer: trace file I/O calls
Disk: file system latency
• And using SystemTap:
# ./vfsrlat.stp
Tracing... Hit Ctrl-C to end
^C
[..]
ext4 (ns):
value |-------------------------------------------------- count
256 |
0
512 |
0
1024 |
16
2048 |
17
4096 |
4
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16321
16384 |
50
32768 |
1
65536 |
13
131072 |
0
262144 |
0

• Traces vfs.read to vfs.read.return, and gets the FS
type via: $file->f_path->dentry->d_inode->i_sb->s_type->name

• Warning: this script has crashed ubuntu/CentOS; I’m told RHEL is better
Disk: file system visualizations
• File system latency as a heat map (Cloud Analytics):

• This screenshot shows severe outliers
Disk: file system visualizations
• Sometimes the heat map is very surprising:

• This screenshot is from the Oracle ZFS Storage Appliance
Disk: summary
• Misleading data available
• New tools/metrics created
• Latency visualizations
Network
Network: problem
• TCP SYNs queue in-kernel until they are accept()ed
• The queue length is the TCP listen backlog
• may be set in listen()
• and limited by a system tunable (usually 128)
• on SmartOS: tcp_conn_req_max_q

• What if the queue remains full
• eg, application is overwhelmed with other work,
• or CPU starved
• ... and another SYN arrives?
Network: TCP listen drops
• Packet is dropped by the kernel
• fortunately a counter is bumped:
$ netstat -s | grep Drop
tcpTimRetransDrop
=
56
tcpTimKeepaliveProbe= 1594
tcpListenDrop
=3089298
tcpHalfOpenDrop
=
0
icmpOutDrops
=
0
sctpTimRetrans
=
0
sctpTimHearBeatProbe=
0
sctpListenDrop
=
0

tcpTimKeepalive
tcpTimKeepaliveDrop
tcpListenDropQ0
tcpOutSackRetrans
icmpOutErrors
sctpTimRetransDrop
sctpTimHearBeatDrop
sctpInClosed

= 2582
=
41
=
0
=1400832
=
0
=
0
=
0
=
0

• Remote host waits, and then retransmits
• TCP retransmit interval; usually 1 or 3 seconds
Network: predicting drops
• How do we know if we are close to dropping?
• An early warning
Network: tcpconnreqmaxq.d
• DTrace script traces drop events, if they occur:
# ./tcpconnreqmaxq.d
Tracing... Hit Ctrl-C to end.
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop
[...]

• ... and when Ctrl-C is hit:

cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
cpid:11504
Network: tcpconnreqmaxq.d
tcp_conn_req_cnt_q distributions:
cpid:3063
value
-1
0
1

max_q:8
------------- Distribution ------------- count
|
0
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
|
0

cpid:11504
value
-1
0
1
2
4
8
16
32
64
128
256

max_q:128
------------- Distribution ------------- count
|
0
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
7279
|@@
405
|@
255
|@
138
|
81
|
83
|
62
|
67
|
34
|
0

tcpListenDrops:
cpid:11504

max_q:128

34
Network: tcpconnreqmaxq.d
tcp_conn_req_cnt_q distributions:
cpid:3063
value
-1
0
1

max_q:8
------------- Distribution ------------- count
|
0
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
|
0

cpid:11504
value
-1
0
1
2
4
8
16
32
64
128
256

max_q:128
------------- Distribution ------------- count
|
0
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
7279
|@@
405
|@
255
|@
138
Length of queue
|
81
|
83
measured
|
62
on SYN event
|
67
|
34
|
0

tcpListenDrops:
cpid:11504

max_q:128

value
in
use

34
Network: tcplistendrop.d
• More details can be fetched as needed:
# ./tcplistendrop.d
TIME
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
2012 Jan 19 01:22:49
[...]

SRC-IP
10.17.210.103
10.17.210.108
10.17.210.116
10.17.210.117
10.17.210.112
10.17.210.106
10.12.143.16
10.17.210.100
10.17.210.99
10.17.210.98
10.17.210.101

PORT
25691
18423
38883
10739
27988
28824
65070
56392
24628
11686
34629

->
->
->
->
->
->
->
->
->
->
->

DST-IP
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212
192.192.240.212

PORT
80
80
80
80
80
80
80
80
80
80
80

• Just tracing the drop code-path
• Don’t need to pay the overhead of sniffing all packets
Network: DTrace code
• Key code from tcplistendrop.d:
fbt::tcp_input_listener:entry { self->mp = args[1]; }
fbt::tcp_input_listener:return { self->mp = 0; }
mib:::tcpListenDrop
/self->mp/
{
this->iph = (ipha_t *)self->mp->b_rptr;
this->tcph = (tcph_t *)(self->mp->b_rptr + 20);
printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp,
inet_ntoa(&this->iph->ipha_src),
ntohs(*(uint16_t *)this->tcph->th_lport),
inet_ntoa(&this->iph->ipha_dst),
ntohs(*(uint16_t *)this->tcph->th_fport));
}

• This uses the unstable interface fbt provider
• a stable tcp provider now exists, which is better for
more common tasks - like connections by IP
Network: summary
• For TCP, while many counters are available, they are
system wide integers

• Custom tools can show more details
• addresses and ports
• kernel state
• needs kernel access and dynamic tracing
Data Recap
• Problem types
• CPU utilization	

scaleability

• CPU usage	

	

scaleability

• CPU latency	

	

observability

• Memory	

	

	

observability

• Disk	 	

	

	

observability

• Network	

	

	

observability
Data Recap
• Problem types, solution types
• CPU utilization	 scaleability
• CPU usage	

	

scaleability

• CPU latency	

	

observability

• Memory	

	

	

observability

• Disk	 	

	

	

observability

• Network	

	

	

observability

visualizations

metrics
Theory
Performance Analysis
• Goals
• Capacity planning
• Issue analysis
Performance Issues
• Strategy
• Step 1: is there a problem?
• Step 2: which subsystem/team is responsible?
• Difficult to get past these steps without reliable
metrics
Problem Space
• Myths
• Vendors provide good metrics with good coverage
• The problem is to line-graph them
• Realities
• Metrics can be wrong, incomplete and misleading,
requiring time and expertise to interpret

• Line graphs can hide issues
Problem Space
• Cloud computing confuses matters further:
• hiding metrics from neighbors
• throttling performance due to invisible neighbors
Example Problems
• Included:
• Understanding utilization across 5,312 CPUs
• Using disk I/O metrics to explain application
performance

• A lack of metrics for memory growth, packet drops, ...
Example Solutions: tools
• Device utilization heat maps for CPUs
• Flame graphs for CPU profiling
• CPU dispatcher queue latency by zone
• CPU caps latency by zone
• malloc() size profiling
• Heap growth stack backtraces
• File system latency distributions
• File system latency tracing
• TCP accept queue length distribution
• TCP listen drop tracing with details
Key Concepts
• Visualizations
• heat maps for device utilization and latency
• flame graphs
• Custom metrics often necessary
• Latency-based for issue analysis
• If coding isn’t practical/timely, use dynamic tracing
• Cloud Computing
• Provide observability (often to show what the problem isn’t)
• Develop new metrics for resource control effects
DTrace
• Many problems were only solved thanks to DTrace
• In the SmartOS cloud environment:
• The compute node (global zone) can DTrace
everything (except for KVM guests, for which it has a
limited view: resource I/O + some MMU events, so far)

• SmartMachines (zones) have the DTrace syscall,
profile (their user-land only), pid and USDT providers

• Joyent Cloud Analytics uses DTrace from the global
zone to give extended details to customers
Performance
• The more you know, the more you don’t
• Hopefully I’ve turned some unknown-unknowns
into known-unknowns
Thank you
• Resources:
• http://dtrace.org/blogs/brendan
• More CPU utilization visualizations:

http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization/

• Flame Graphs: http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
and http://github.com/brendangregg/FlameGraph

• More iostat(1) & file system latency discussion:

http://dtrace.org/blogs/brendan/tag/filesystem-2/

• Cloud Analytics:
• OSCON slides: http://dtrace.org/blogs/dap/files/2011/07/ca-oscon-data.pdf
• Joyent: http://joyent.com

• brendan@joyent.com

More Related Content

What's hot

ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Brendan Gregg
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
Brendan Gregg
 

What's hot (20)

Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame GraphsUSENIX ATC 2017: Visualizing Performance with Flame Graphs
USENIX ATC 2017: Visualizing Performance with Flame Graphs
 
JavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame GraphsJavaOne 2015 Java Mixed-Mode Flame Graphs
JavaOne 2015 Java Mixed-Mode Flame Graphs
 
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
From DTrace to Linux
From DTrace to LinuxFrom DTrace to Linux
From DTrace to Linux
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
DTrace Topics: Introduction
DTrace Topics: IntroductionDTrace Topics: Introduction
DTrace Topics: Introduction
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 

Viewers also liked

Viewers also liked (20)

Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Real-time in the real world: DIRT in production
Real-time in the real world: DIRT in productionReal-time in the real world: DIRT in production
Real-time in the real world: DIRT in production
 
HTTP Application Performance Analysis
HTTP Application Performance AnalysisHTTP Application Performance Analysis
HTTP Application Performance Analysis
 
Performance analysis 2013
Performance analysis 2013Performance analysis 2013
Performance analysis 2013
 
Web performance optimization (WPO)
Web performance optimization (WPO)Web performance optimization (WPO)
Web performance optimization (WPO)
 
Approaches to Software Testing
Approaches to Software TestingApproaches to Software Testing
Approaches to Software Testing
 
DTraceCloud2012
DTraceCloud2012DTraceCloud2012
DTraceCloud2012
 
The New Systems Performance
The New Systems PerformanceThe New Systems Performance
The New Systems Performance
 
Linux Performance Tools 2014
Linux Performance Tools 2014Linux Performance Tools 2014
Linux Performance Tools 2014
 
Measuring the Performance of Single Page Applications
Measuring the Performance of Single Page ApplicationsMeasuring the Performance of Single Page Applications
Measuring the Performance of Single Page Applications
 
Application Performance Management - Solving the Performance Puzzle
Application Performance Management - Solving the Performance PuzzleApplication Performance Management - Solving the Performance Puzzle
Application Performance Management - Solving the Performance Puzzle
 
An Introduction to Software Performance Engineering
An Introduction to Software Performance EngineeringAn Introduction to Software Performance Engineering
An Introduction to Software Performance Engineering
 
Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Application Performance Monitoring (APM)
Application Performance Monitoring (APM)
 
Effective Use Of Performance Analysis
Effective Use Of Performance AnalysisEffective Use Of Performance Analysis
Effective Use Of Performance Analysis
 
A Modern Approach to Performance Monitoring
A Modern Approach to Performance MonitoringA Modern Approach to Performance Monitoring
A Modern Approach to Performance Monitoring
 
Using dynaTrace to optimise application performance
Using dynaTrace to optimise application performanceUsing dynaTrace to optimise application performance
Using dynaTrace to optimise application performance
 
Dynatrace
DynatraceDynatrace
Dynatrace
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 

Similar to Performance Analysis: new tools and concepts from the cloud

Summit demystifying systemd1
Summit demystifying systemd1Summit demystifying systemd1
Summit demystifying systemd1
Susant Sahani
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
dflexer
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
Enkitec
 

Similar to Performance Analysis: new tools and concepts from the cloud (20)

iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
From swarm to swam-mode in the CERN container service
From swarm to swam-mode in the CERN container serviceFrom swarm to swam-mode in the CERN container service
From swarm to swam-mode in the CERN container service
 
Instrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in productionInstrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in production
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Summit demystifying systemd1
Summit demystifying systemd1Summit demystifying systemd1
Summit demystifying systemd1
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022System Device Tree and Lopper: Concrete Examples - ELC NA 2022
System Device Tree and Lopper: Concrete Examples - ELC NA 2022
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G coreTối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharingVinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
Oracle Performance Tuning Fundamentals
Oracle Performance Tuning FundamentalsOracle Performance Tuning Fundamentals
Oracle Performance Tuning Fundamentals
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 

More from Brendan Gregg

More from Brendan Gregg (20)

IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Performance Analysis: new tools and concepts from the cloud

  • 1. Performance Analysis: new tools and concepts from the cloud Brendan Gregg Lead Performance Engineer, Joyent brendan.gregg@joyent.com SCaLE10x Jan, 2012
  • 2. whoami • I do performance analysis • I also write performance tools out of necessity • Was Brendan @ Sun Microsystems, Oracle, now Joyent
  • 3. Joyent • Cloud computing provider • Cloud computing software • SmartOS • host OS, and guest via OS virtualization • Linux, Windows • guest via KVM
  • 4. Agenda • Data • Example problems & solutions • How cloud environments complicate performance • Theory • Performance analysis • Summarize new tools & concepts • This talk uses SmartOS and DTrace to illustrate concepts that are applicable to most OSes.
  • 5. Data • Example problems: • CPU • Memory • Disk • Network • Some have neat solutions, some messy, some none • This is real world • Some I’ve covered before, some I haven’t
  • 6. CPU
  • 7. CPU utilization: problem • Would like to identify: • single or multiple CPUs at 100% utilization • average, minimum and maximum CPU utilization • CPU utilization balance (tight or loose distribution) • time-based characteristics changing/bursting? burst interval, burst length • For small to large environments • entire datacenters or clouds
  • 8. CPU utilization • mpstat(1) has the data. 1 second, 1 server (16 CPUs):
  • 9. CPU utilization • Scaling to 60 seconds, 1 server:
  • 10. CPU utilization • Scaling to entire datacenter, 60 secs, 5312 CPUs:
  • 11. CPU utilization • Line graphs can solve some problems: • x-axis: time, 60 seconds • y-axis: utilization
  • 12. CPU utilization • ... but don’t scale well to individual devices • 5312 CPUs, each as a line:
  • 13. CPU utilization • Pretty, but scale limited as well:
  • 14. CPU utilization • Utilization as a heat map: • x-axis: time, y-axis: utilization • z-axis (color): number of CPUs
  • 15. CPU utilization • Available in Cloud Analytics (Joyent) • Clicking highlights and shows details; eg, hostname:
  • 16. CPU utilization • Utilization heat map also suitable and used for: • disks • network interfaces • Utilization as a metric can be a bit misleading • really a percent busy over a time interval • devices may accept more work at 100% busy • may not directly relate to performance impact
  • 17. CPU utilization: summary • Data readily available • Using a new visualization
  • 18. CPU usage • Given a CPU is hot, what is it doing? • Beyond just vmstat’s usr/sys ratio • Profiling (sampling at an interval) the program counter or stack back trace • user-land stack for %usr • kernel stack for %sys • Many tools can do this to some degree • Developer Studios/DTrace/oprofile/...
  • 19. CPU usage: profiling • Frequency count on-CPU user-land stack traces: # dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ { @[ustack()] = count(); } tick-60s { exit(0); }' dtrace: description 'profile-997 ' matched 2 probes CPU ID FUNCTION:NAME 1 75195 :tick-60s [...] libc.so.1`__priocntlset+0xa libc.so.1`getparam+0x83 libc.so.1`pthread_getschedparam+0x3c libc.so.1`pthread_setschedprio+0x1f mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab mysqld`_Z10do_commandP3THD+0x198 mysqld`handle_one_connection+0x1a6 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 4884 mysqld`_Z13add_to_statusP17system_status_varS0_+0x47 mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222 mysqld`_Z10do_commandP3THD+0x198 mysqld`handle_one_connection+0x1a6 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 5530
  • 20. CPU usage: profiling • Frequency count on-CPU user-land stack traces: # dtrace -x ustackframes=100 -n 'profile-997 /execname == "mysqld"/ { @[ustack()] = count(); } tick-60s { exit(0); }' dtrace: description 'profile-997 ' matched 2 probes CPU ID FUNCTION:NAME 1 75195 :tick-60s [...] libc.so.1`__priocntlset+0xa libc.so.1`getparam+0x83 libc.so.1`pthread_getschedparam+0x3c libc.so.1`pthread_setschedprio+0x1f mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x9ab mysqld`_Z10do_commandP3THD+0x198 mysqld`handle_one_connection+0x1a6 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 4884 Over 500,000 lines truncated mysqld`_Z13add_to_statusP17system_status_varS0_+0x47 mysqld`_Z22calc_sum_of_all_statusP17system_status_var+0x67 mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x1222 mysqld`_Z10do_commandP3THD+0x198 mysqld`handle_one_connection+0x1a6 libc.so.1`_thrp_setup+0x8d libc.so.1`_lwp_start 5530
  • 22. CPU usage: visualization • Visualized as a “Flame Graph”:
  • 23. CPU usage: Flame Graphs • Just some Perl that turns DTrace output into an interactive SVG: mouse-over elements for details • It’s on github • http://github.com/brendangregg/FlameGraph • Works on kernel stacks, and both user+kernel • Shouldn’t be hard to have it process oprofile, etc.
  • 24. CPU usage: on the Cloud • Flame Graphs were born out of necessity on Cloud environments: • Perf issues need quick resolution (you just got hackernews’d) • Everyone is running different versions of everything (don’t assume you’ve seen the last of old CPU-hot code-path issues that have been fixed)
  • 25. CPU usage: summary • Data can be available • For cloud computing: easy for operators to fetch on OS virtualized environments; otherwise agent driven, and possibly other difficulties (access to CPU instrumentation counter-based interrupts) • Using a new visualization
  • 26. CPU latency • CPU dispatcher queue latency • thread is ready-to-run, and waiting its turn • Observable in coarse ways: • vmstat’s r • high load averages • Less course, with microstate accounting • prstat -mL’s LAT • How much is it affecting application performance?
  • 27. CPU latency: zonedispqlat.d • Using DTrace to trace kernel scheduler events: #./zonedisplat.d Tracing... Note: outliers (> 1 secs) may be artifacts due to the use of scalar globals (sorry). CPU disp queue latency by zone (ns): dbprod-045 value 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 [...] ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10210 |@@@@@@@@@@ 3829 |@ 514 | 94 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0
  • 28. CPU latency: zonedispqlat.d • CPU dispatcher queue latency by zonename (zonedispqlat.d), work in progress: #!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN { printf("Tracing...n"); printf("Note: outliers (> 1 secs) may be artifacts due to the "); printf("use of scalar globals (sorry).nn"); } sched:::enqueue { /* scalar global (I don't think this can be thread local) */ start[args[0]->pr_lwpid, args[1]->pr_pid] = timestamp; } sched:::dequeue /this->start = start[args[0]->pr_lwpid, args[1]->pr_pid]/ { this->time = timestamp - this->start; /* workaround since zonename isn't a member of args[1]... */ this->zone = ((proc_t *)args[1]->pr_addr)->p_zone->zone_name; @[stringof(this->zone)] = quantize(this->time); start[args[0]->pr_lwpid, args[1]->pr_pid] = 0; } tick-1sec { printf("CPU disp queue latency by zone (ns):n"); printa(@); trunc(@); } Save timestamp on enqueue; calculate delta on dequeue
  • 29. CPU latency: zonedispqlat.d • Instead of zonename, this could be process name, ... • Tracing scheduler enqueue/dequeue events and saving timestamps costs CPU overhead • they are frequent • I’d prefer to only trace dequeue, and reuse the existing microstate accounting timestamps • but one problem is a clash between unscaled and scaled timestamps
  • 30. CPU latency: on the Cloud • With virtualization, you can have: high CPU latency with idle CPUs due to an instance consuming their quota • OS virtualization • not visible in vmstat r • is visible as part of prstat -mL’s LAT • more kstats recently added to SmartOS including nsec_waitrq (total run queue wait by zone) • Hardware virtualization • vmstat st (stolen)
  • 31. CPU latency: caps • CPU cap latency from the host (zonecapslat.d): #!/usr/sbin/dtrace -s #pragma D option quiet sched:::cpucaps-sleep { start[args[0]->pr_lwpid, args[1]->pr_pid] = timestamp; } sched:::cpucaps-wakeup /this->start = start[args[0]->pr_lwpid, args[1]->pr_pid]/ { this->time = timestamp - this->start; /* workaround since zonename isn't a member of args[1]... */ this->zone = ((proc_t *)args[1]->pr_addr)->p_zone->zone_name; @[stringof(this->zone)] = quantize(this->time); start[args[0]->pr_lwpid, args[1]->pr_pid] = 0; } tick-1sec { printf("CPU caps latency by zone (ns):n"); printa(@); trunc(@); }
  • 32. CPU latency: summary • Partial data available • New tools/metrics created • although current DTrace solutions have overhead; we should be able to improve that • although, new kstats may be sufficient
  • 34. Memory: problem • Riak database has endless memory growth. • expected 9GB, after two days: $ prstat -c 1 Please wait... PID USERNAME SIZE RSS STATE 21722 103 43G 40G cpu0 15770 root 7760K 540K sleep 95 root 0K 0K sleep 12827 root 128M 73M sleep 10319 bgregg 10M 6788K sleep 10402 root 22M 288K sleep [...] PRI NICE 59 0 57 0 99 -20 100 59 0 59 0 TIME 72:23:41 23:28:57 7:37:47 0:49:36 0:00:00 0:18:45 CPU 2.6% 0.9% 0.2% 0.1% 0.0% 0.0% PROCESS/NLWP beam.smp/594 zoneadmd/5 zpool-zones/166 node/5 sshd/1 dtrace/1 • Eventually hits paging and terrible performance • needing a restart • Is this a memory leak? Or application growth?
  • 35. Memory: scope • Identify the subsystem and team responsible Subsystem Team Application Voxer Riak Basho Erlang Ericsson SmartOS Joyent
  • 36. Memory: heap profiling • What is in the heap? $ pmap 14719 14719: beam.smp 0000000000400000 000000000062D000 000000000067F000 00000001005C0000 00000002005BE000 0000000300382000 00000004002E2000 00000004FFFD3000 00000005FFF91000 00000006FFF4C000 00000007FF9EF000 [...] 2168K 328K 4193540K 4194296K 4192016K 4193664K 4191172K 4194040K 4194028K 4188812K 588224K r-x-rw--rw--rw--rw--rw--rw--rw--rw--rw--rw--- /opt/riak/erts-5.8.5/bin/beam.smp /opt/riak/erts-5.8.5/bin/beam.smp /opt/riak/erts-5.8.5/bin/beam.smp [ anon ] [ anon ] [ anon ] [ anon ] [ anon ] [ anon ] [ anon ] [ heap ] • ... and why does it keep growing? • Would like to answer these in production • Without restarting apps. Experimentation (backend=mmap, other allocators) wasn’t working.
  • 37. Memory: heap profiling • libumem was used for multi-threaded performance • libumem == user-land slab allocator • detailed observability can be enabled, allowing heap profiling and leak detection • While designed with speed and production use in mind, it still comes with some cost (time and space), and aren’t on by default. • UMEM_DEBUG=audit
  • 38. Memory: heap profiling • libumem provides some default observability • Eg, slabs: > ::umem_malloc_info CACHE BUFSZ MAXMAL BUFMALLC 0000000000707028 8 0 0 000000000070b028 16 8 8730 000000000070c028 32 16 8772 000000000070f028 48 32 1148038 0000000000710028 64 48 344138 0000000000711028 80 64 36 0000000000714028 96 80 8934 0000000000715028 112 96 1347040 0000000000718028 128 112 253107 000000000071a028 160 144 40529 000000000071b028 192 176 140 000000000071e028 224 208 43 000000000071f028 256 240 133 0000000000720028 320 304 56 0000000000723028 384 368 35 [...] AVG_MAL 0 8 16 25 40 62 79 87 111 118 155 188 229 276 335 MALLOCED 0 69836 140352 29127788 13765658 2226 705348 117120208 28011923 4788681 21712 8101 30447 15455 11726 OVERHEAD 0 1054998 1130491 156179051 58417287 4806 1168558 190389780 42279506 6466801 25818 6497 26211 12276 7220 %OVER 0.0% 1510.6% 805.4% 536.1% 424.3% 215.9% 165.6% 162.5% 150.9% 135.0% 118.9% 80.1% 86.0% 79.4% 61.5%
  • 39. Memory: heap profiling • ... and heap (captured @14GB RSS): > ::vmem ADDR NAME fffffd7ffebed4a0 sbrk_top fffffd7ffebee0a8 sbrk_heap fffffd7ffebeecb0 vmem_internal fffffd7ffebef8b8 vmem_seg fffffd7ffebf04c0 vmem_hash fffffd7ffebf10c8 vmem_vmem 00000000006e7000 umem_internal 00000000006e8000 umem_cache 00000000006e9000 umem_hash 00000000006ea000 umem_log 00000000006eb000 umem_firewall_va 00000000006ec000 umem_firewall 00000000006ed000 umem_oversize 00000000006f0000 umem_memalign 0000000000706000 umem_default INUSE 9090404352 9090404352 664616960 651993088 12583424 46200 352862464 113696 13091328 0 0 0 5218777974 0 2552131584 TOTAL 14240165888 9090404352 664616960 651993088 12587008 55344 352866304 180224 13099008 0 0 0 5520789504 0 2552131584 SUCCEED FAIL 4298117 84403 4298117 0 79621 0 79589 0 27 0 15 0 88746 0 44 0 86 0 0 0 0 0 0 0 3822051 0 0 0 307699 0 • The heap is 9 GB (as expected), but sbrk_top total is 14 GB (equal to RSS). And growing. • Are there Gbyte-sized malloc()/free()s?
  • 40. Memory: malloc() profiling # dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472 dtrace: description 'pid$target::malloc:entry ' matched 3 probes ^C value ------------- Distribution ------------- count 2 | 0 4 | 3 8 |@ 5927 16 |@@@@ 41818 32 |@@@@@@@@@ 81991 64 |@@@@@@@@@@@@@@@@@@ 169888 128 |@@@@@@@ 69891 256 | 2257 512 | 406 1024 | 893 2048 | 146 4096 | 1467 8192 | 755 16384 | 950 32768 | 83 65536 | 31 131072 | 11 262144 | 15 524288 | 0 1048576 | 1 2097152 | 0 • No huge malloc()s, but RSS continues to climb.
  • 41. Memory: malloc() profiling # dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472 dtrace: description 'pid$target::malloc:entry ' matched 3 probes ^C value ------------- Distribution ------------- count 2 | 0 4 | 3 8 |@ 5927 16 |@@@@ 41818 32 |@@@@@@@@@ 81991 64 |@@@@@@@@@@@@@@@@@@ 169888 128 |@@@@@@@ 69891 256 | 2257 512 | 406 1024 | 893 2048 | 146 4096 | 1467 8192 | 755 16384 | 950 32768 | 83 65536 | 31 131072 | 11 262144 | 15 524288 | 0 1048576 | 1 2097152 | 0 This tool (one-liner) profiles malloc() request sizes • No huge malloc()s, but RSS continues to climb.
  • 42. Memory: heap growth • Tracing why the heap grows via brk(): # dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }' dtrace: description 'syscall::brk:entry ' matched 1 probe CPU ID FUNCTION:NAME 10 18 brk:entry libc.so.1`_brk_unlocked+0xa libumem.so.1`vmem_sbrk_alloc+0x84 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.14`_Znwm+0x20 libstdc++.so.6.0.14`_Znam+0x9 eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea... eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S... eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc eleveldb.so`eleveldb_get+0xd3 beam.smp`process_main+0x6939 beam.smp`sched_thread_func+0x1cf beam.smp`thr_wrapper+0xbe This shows the user-land stack trace for every heap growth
  • 43. Memory: heap growth • More DTrace showed the size of the malloc()s causing the brk()s: # dtrace -x dynvarsize=4m -n ' pid$target::malloc:entry { self->size = arg0; } syscall::brk:entry /self->size/ { printf("%d bytes", self->size); } pid$target::malloc:return { self->size = 0; }' -p 17472 dtrace: CPU 0 0 description 'pid$target::malloc:entry ' matched 7 probes ID FUNCTION:NAME 44 brk:entry 8343520 bytes 44 brk:entry 8343520 bytes [...] • These 8 Mbyte malloc()s grew the heap • Even though the heap has Gbytes not in use • This is starting to look like an OS issue
  • 44. Memory: allocator internals • More tools were created: • Show memory entropy (+ malloc - free) along with heap growth, over time • Show codepath taken for allocations compare successful with unsuccessful (heap growth) • Show allocator internals: sizes, options, flags • And run in the production environment • Briefly; tracing frequent allocs does cost overhead • Casting light into what was a black box
  • 45. Memory: allocator internals 4 <- vmem_xalloc 0 4 -> _sbrk_grow_aligned 4096 4 <- _sbrk_grow_aligned 17155911680 4 -> vmem_xalloc 7356400 4 | vmem_xalloc:entry umem_oversize 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_heap 4 -> vmem_sbrk_alloc 7356416 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_top 4 -> vmem_reap 16777216 4 <- vmem_reap 3178535181209758 4 | vmem_xalloc:return vmem_xalloc() == NULL, vm: sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0, vmflag: 1 libumem.so.1`vmem_xalloc+0x80f libumem.so.1`vmem_sbrk_alloc+0x33 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.3`_Znwm+0x2b libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e
  • 46. Memory: solution • These new tools and metrics pointed to the allocation algorithm “instant fit” • Someone had suggested this earlier; the tools provided solid evidence that this really was the case here • A new version of libumem was built to force use of VM_BESTFIT • and added by Robert Mustacchi as a tunable: UMEM_OPTIONS=allocator=best • Customer restarted Riak with new libumem version • Problem solved
  • 47. Memory: on the Cloud • With OS virtualization, you can have: Paging without scanning • paging == swapping blocks with physical storage • swapping == swapping entire threads between main memory and physical storage • Resource control paging is unrelated to the page scanner, so, no vmstat scan rate (sr) despite anonymous paging • More new tools: DTrace sysinfo:::anonpgin by process name, zonename
  • 48. Memory: summary • Superficial data available, detailed info not • not by default • Many new tools were created • not easy, but made possible with DTrace
  • 49. Disk
  • 50. Disk: problem • Application performance issues • Disks look busy (iostat) • Blame the disks? $ iostat -xnz 1 [...] r/s 124.0 w/s 334.9 r/s 114.0 w/s 407.1 r/s 85.0 w/s 438.0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t %w %b device 15677.2 40484.9 0.0 1.0 0.0 2.2 1 69 c0t1d0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t %w %b device 14595.9 49409.1 0.0 0.8 0.0 1.5 1 56 c0t1d0 extended device statistics kr/s kw/s wait actv wsvc_t asvc_t %w %b device 10814.8 53242.1 0.0 0.8 0.0 1.6 1 57 c0t1d0 • Many graphical tools are built upon iostat
  • 51. Disk: on the Cloud • Tenants can’t see each other • Maybe a neighbor is doing a backup? • Maybe a neighbor is running a benchmark? • Can’t see their processes (top/prstat) • Blame what you can’t see
  • 52. Disk: VFS • Applications usually talk to a file system • and are hurt by file system latency • Disk I/O can be: • unrelated to the application: asynchronous tasks • inflated from what the application requested • deflated “ “ • blind to issues caused higher up the kernel stack
  • 53. Disk: issues with iostat(1) • Unrelated: • other applications / tenants • file system prefetch • file system dirty data fushing • Inflated: • rounded up to the next file system record size • extra metadata for on-disk format • read-modify-write of RAID5
  • 54. Disk: issues with iostat(1) • Deflated: • read caching • write buffering • Blind: • lock contention in the file system • CPU usage by the file system • file system software bugs • file system queue latency
  • 55. Disk: issues with iostat(1) • blind (continued): • disk cache flush latency (if your file system does it) • file system I/O throttling latency • I/O throttling is a new ZFS feature for cloud environments • adds artificial latency to file system I/O to throttle it • added by Bill Pijewski and Jerry Jelenik of Joyent
  • 56. Disk: file system latency • Using DTrace to summarize ZFS read latency: $ dtrace -n 'fbt::zfs_read:entry { self->start = timestamp; } fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' dtrace: description 'fbt::zfs_read:entry ' matched 2 probes ^C ns value 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 ------------- Distribution ------------- count | 0 |@ 6 |@@ 18 |@@@@@@@ 79 |@@@@@@@@@@@@@@@@@ 191 |@@@@@@@@@@ 112 |@ 14 | 1 | 1 | 0 | 0 | 0 | 0 |@@@ 31 |@ 9 | 0
  • 57. Disk: file system latency • Using DTrace to summarize ZFS read latency: $ dtrace -n 'fbt::zfs_read:entry { self->start = timestamp; } fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' dtrace: description 'fbt::zfs_read:entry ' matched 2 probes ^C ns value 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 ------------- Distribution ------------- count | 0 |@ 6 |@@ 18 |@@@@@@@ 79 |@@@@@@@@@@@@@@@@@ 191 |@@@@@@@@@@ 112 |@ 14 | 1 | 1 | 0 | 0 | 0 | 0 |@@@ 31 |@ 9 | 0 Cache reads Disk reads
  • 58. Disk: file system latency • Tracing zfs events using zfsslower.d: # ./zfsslower.d 10 TIME 2011 May 17 01:23:12 2011 May 17 01:23:13 2011 May 17 01:23:33 2011 May 17 01:23:33 2011 May 17 01:23:51 ^C PROCESS mysqld mysqld mysqld mysqld httpd D R W W W R KB 16 16 16 16 56 ms 19 10 11 10 14 FILE /z01/opt/mysql5-64/data/xxxxx/xxxxx.ibd /z01/var/mysql/xxxxx/xxxxx.ibd /z01/var/mysql/xxxxx/xxxxx.ibd /z01/var/mysql/xxxxx/xxxxx.ibd /z01/home/xxxxx/xxxxx/xxxxx/xxxxx/xxxxx • Argument is the minimum latency in milliseconds
  • 59. Disk: file system latency • Can trace this from other locations too: • VFS layer: filter on desired file system types • syscall layer: filter on file descriptors for file systems • application layer: trace file I/O calls
  • 60. Disk: file system latency • And using SystemTap: # ./vfsrlat.stp Tracing... Hit Ctrl-C to end ^C [..] ext4 (ns): value |-------------------------------------------------- count 256 | 0 512 | 0 1024 | 16 2048 | 17 4096 | 4 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16321 16384 | 50 32768 | 1 65536 | 13 131072 | 0 262144 | 0 • Traces vfs.read to vfs.read.return, and gets the FS type via: $file->f_path->dentry->d_inode->i_sb->s_type->name • Warning: this script has crashed ubuntu/CentOS; I’m told RHEL is better
  • 61. Disk: file system visualizations • File system latency as a heat map (Cloud Analytics): • This screenshot shows severe outliers
  • 62. Disk: file system visualizations • Sometimes the heat map is very surprising: • This screenshot is from the Oracle ZFS Storage Appliance
  • 63. Disk: summary • Misleading data available • New tools/metrics created • Latency visualizations
  • 65. Network: problem • TCP SYNs queue in-kernel until they are accept()ed • The queue length is the TCP listen backlog • may be set in listen() • and limited by a system tunable (usually 128) • on SmartOS: tcp_conn_req_max_q • What if the queue remains full • eg, application is overwhelmed with other work, • or CPU starved • ... and another SYN arrives?
  • 66. Network: TCP listen drops • Packet is dropped by the kernel • fortunately a counter is bumped: $ netstat -s | grep Drop tcpTimRetransDrop = 56 tcpTimKeepaliveProbe= 1594 tcpListenDrop =3089298 tcpHalfOpenDrop = 0 icmpOutDrops = 0 sctpTimRetrans = 0 sctpTimHearBeatProbe= 0 sctpListenDrop = 0 tcpTimKeepalive tcpTimKeepaliveDrop tcpListenDropQ0 tcpOutSackRetrans icmpOutErrors sctpTimRetransDrop sctpTimHearBeatDrop sctpInClosed = 2582 = 41 = 0 =1400832 = 0 = 0 = 0 = 0 • Remote host waits, and then retransmits • TCP retransmit interval; usually 1 or 3 seconds
  • 67. Network: predicting drops • How do we know if we are close to dropping? • An early warning
  • 68. Network: tcpconnreqmaxq.d • DTrace script traces drop events, if they occur: # ./tcpconnreqmaxq.d Tracing... Hit Ctrl-C to end. 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop 2012 Jan 19 01:37:52 tcp_input_listener:tcpListenDrop [...] • ... and when Ctrl-C is hit: cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504 cpid:11504
  • 69. Network: tcpconnreqmaxq.d tcp_conn_req_cnt_q distributions: cpid:3063 value -1 0 1 max_q:8 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 | 0 cpid:11504 value -1 0 1 2 4 8 16 32 64 128 256 max_q:128 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279 |@@ 405 |@ 255 |@ 138 | 81 | 83 | 62 | 67 | 34 | 0 tcpListenDrops: cpid:11504 max_q:128 34
  • 70. Network: tcpconnreqmaxq.d tcp_conn_req_cnt_q distributions: cpid:3063 value -1 0 1 max_q:8 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 | 0 cpid:11504 value -1 0 1 2 4 8 16 32 64 128 256 max_q:128 ------------- Distribution ------------- count | 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279 |@@ 405 |@ 255 |@ 138 Length of queue | 81 | 83 measured | 62 on SYN event | 67 | 34 | 0 tcpListenDrops: cpid:11504 max_q:128 value in use 34
  • 71. Network: tcplistendrop.d • More details can be fetched as needed: # ./tcplistendrop.d TIME 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 2012 Jan 19 01:22:49 [...] SRC-IP 10.17.210.103 10.17.210.108 10.17.210.116 10.17.210.117 10.17.210.112 10.17.210.106 10.12.143.16 10.17.210.100 10.17.210.99 10.17.210.98 10.17.210.101 PORT 25691 18423 38883 10739 27988 28824 65070 56392 24628 11686 34629 -> -> -> -> -> -> -> -> -> -> -> DST-IP 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 192.192.240.212 PORT 80 80 80 80 80 80 80 80 80 80 80 • Just tracing the drop code-path • Don’t need to pay the overhead of sniffing all packets
  • 72. Network: DTrace code • Key code from tcplistendrop.d: fbt::tcp_input_listener:entry { self->mp = args[1]; } fbt::tcp_input_listener:return { self->mp = 0; } mib:::tcpListenDrop /self->mp/ { this->iph = (ipha_t *)self->mp->b_rptr; this->tcph = (tcph_t *)(self->mp->b_rptr + 20); printf("%-20Y %-18s %-5d -> %-18s %-5dn", walltimestamp, inet_ntoa(&this->iph->ipha_src), ntohs(*(uint16_t *)this->tcph->th_lport), inet_ntoa(&this->iph->ipha_dst), ntohs(*(uint16_t *)this->tcph->th_fport)); } • This uses the unstable interface fbt provider • a stable tcp provider now exists, which is better for more common tasks - like connections by IP
  • 73. Network: summary • For TCP, while many counters are available, they are system wide integers • Custom tools can show more details • addresses and ports • kernel state • needs kernel access and dynamic tracing
  • 74. Data Recap • Problem types • CPU utilization scaleability • CPU usage scaleability • CPU latency observability • Memory observability • Disk observability • Network observability
  • 75. Data Recap • Problem types, solution types • CPU utilization scaleability • CPU usage scaleability • CPU latency observability • Memory observability • Disk observability • Network observability visualizations metrics
  • 77. Performance Analysis • Goals • Capacity planning • Issue analysis
  • 78. Performance Issues • Strategy • Step 1: is there a problem? • Step 2: which subsystem/team is responsible? • Difficult to get past these steps without reliable metrics
  • 79. Problem Space • Myths • Vendors provide good metrics with good coverage • The problem is to line-graph them • Realities • Metrics can be wrong, incomplete and misleading, requiring time and expertise to interpret • Line graphs can hide issues
  • 80. Problem Space • Cloud computing confuses matters further: • hiding metrics from neighbors • throttling performance due to invisible neighbors
  • 81. Example Problems • Included: • Understanding utilization across 5,312 CPUs • Using disk I/O metrics to explain application performance • A lack of metrics for memory growth, packet drops, ...
  • 82. Example Solutions: tools • Device utilization heat maps for CPUs • Flame graphs for CPU profiling • CPU dispatcher queue latency by zone • CPU caps latency by zone • malloc() size profiling • Heap growth stack backtraces • File system latency distributions • File system latency tracing • TCP accept queue length distribution • TCP listen drop tracing with details
  • 83. Key Concepts • Visualizations • heat maps for device utilization and latency • flame graphs • Custom metrics often necessary • Latency-based for issue analysis • If coding isn’t practical/timely, use dynamic tracing • Cloud Computing • Provide observability (often to show what the problem isn’t) • Develop new metrics for resource control effects
  • 84. DTrace • Many problems were only solved thanks to DTrace • In the SmartOS cloud environment: • The compute node (global zone) can DTrace everything (except for KVM guests, for which it has a limited view: resource I/O + some MMU events, so far) • SmartMachines (zones) have the DTrace syscall, profile (their user-land only), pid and USDT providers • Joyent Cloud Analytics uses DTrace from the global zone to give extended details to customers
  • 85. Performance • The more you know, the more you don’t • Hopefully I’ve turned some unknown-unknowns into known-unknowns
  • 86. Thank you • Resources: • http://dtrace.org/blogs/brendan • More CPU utilization visualizations: http://dtrace.org/blogs/brendan/2011/12/18/visualizing-device-utilization/ • Flame Graphs: http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/ and http://github.com/brendangregg/FlameGraph • More iostat(1) & file system latency discussion: http://dtrace.org/blogs/brendan/tag/filesystem-2/ • Cloud Analytics: • OSCON slides: http://dtrace.org/blogs/dap/files/2011/07/ca-oscon-data.pdf • Joyent: http://joyent.com • brendan@joyent.com