Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

Scaling Nagios 4
Daniel Wittenberg
daniel.wittenberg@ipsoft.com

About MeAbout Me
● Unix/Linux admin since mid 90's
● Nagios/Netsaint user since early 2000's
● Owned/operated consulting business for almost 10 years that
provided distributed monitoring using Nagios
● Previously employed by Fortune 50 Insurance company
● Currently Monitoring Platform Manager at IPsoft Inc.

About IPsoftAbout IPsoft
● Provider of Remote Infrastructure Management and automation
services
● ITIL and 6 Sigma compliance management framework
● Automation that resolves 56% of all incidents, and 90% L1
● Monitoring, Automation, Event Correlation, Management....
● Offices around the world in ten countries
● http://www.ipsoft.com

My ConfigurationMy Configuration
● ~700 Nagios Servers
● ~130,000 Monitored Devices
● ~3,000,000 Service Checks
● Mix of customized Nagios 3.2.3 and 4.0.0
● Scientific Linux 6.2/6.4
● Managed by Puppet 3.x
● 2/3 on VMware ESX rest are bare metal
● Adding new Nagios servers almost daily

What's different with Nagios 4What's different with Nagios 4
SPEED!
● Current testing shows on average 500% faster over 3.2.3

What's different with Nagios 4What's different with Nagios 4
Some things that would impact performance/stability
http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html
● Embedded Perl – Gone
● external_command_buffer_slots - Gone
● -x option to not verify circular paths no longer needed in rc scripts
● Configuration Verification algorithm changes, massive startup speed increase
● Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres.
● Disk I/O reduced to virtually 0
● NEW query handler interface, better communication with core
● NEW core workers – reduces I/O, memory, CPU
● Completely re-written spec file for better installs, debug modes

Perf Testing Lab SetupPerf Testing Lab Setup
● Servers are all ESX 5 based VM's on the same cluster
● Variable CPU cores, 4GB memory
● Metrics used to consider a test failure:
● CPU Block Queue > 3
● CPU I/O Wait > 3
● CPU Idle < 10%
● Service Check Latency > 1s
● Host Check Latency > 1s
● 30 minute run time, > 3% failure rate failed the test
● Fully automated increasing work load, consistent results
● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.

Test Lab ArchitectureTest Lab Architecture

Test ResultsTest Results
CPU Cores Service Checks
Version 3.2.3
Service Checks
Version 4.0.0rc1
Difference
1 1700 10500 617%
2 3300 20800 630%
4 6500 35300 543%
8 11700 45100 385%

Other software usedOther software used
● Customized livestatus based on Andreas updates for Nagios 4
● https://github.com/ageric/livestatus
● Developing custom “single pane” interface to replace CGI/Check_mk Multisite
● Developing full REST API to talk to QH, livestatus and config files
● nagios-qh.rb Query Handler interface to gather loadctl metrics
● https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb
● Custom load control daemon that talks to QH
● Custom Event Broker to send perf data directly to ActiveMQ for post-
processing
● Custom agent, like NRPE on steroids without limitations like buffer size

Other performance tweaksOther performance tweaks
● Sysctl Changes
● net.ipv4.tcp_fin_timeout
● net.ipv4.tcp_keepalive_profiles
● net.ipv4.tcp_tw_recycle
● net.ipv4.tcp_tw.reuse
● No longer need RAMDISK, but still in the default sysconfig/RC script for now
● Keep logging levels as low as possible
● Disable CGI's whenever possible
● Disable Environment Macros
● Don't use resource macros when you don't need to, they are not cached

Other performance tweaksOther performance tweaks
● /etc/security/limits.d/nagios.conf
● ipmon soft nofile 131072
● ipmon hard nofile 131072
● ipmon soft nproc 131072
● ipmon hard nproc 131072
● Nearly disable OOM killer for the nagios process, saves it until last
● echo '-16' > /proc/<nagios pid>/oom_adj
● Re-nice puppet to run at 10 so less impacting (true for any extra services)
● /etc/sysconfig/puppet – NICELEVEL=10
● This should apply to any other running services that might take resources

Common Perf ToolsCommon Perf Tools
● vmstat / top – cpu/memory
● iostat / iotop – disk usage
● iptraf - network
● sar – cpu/memory/disk
● strace – immediate debugging, also debugging QA
● esxtop – VM stats
● tuned – can dynamically tune system
● perf record -p <pid> / perf list / perf top -u nagios

How to keep it running goodHow to keep it running good
● Monitor everything...you can never have too much info!
● CPU load and CPU stats (idle/wait/user/system)
● Disk space, inodes free
● All application/system logs (apache, syslog, nagios.log, etc.)
● Hardware status
● Swap / Physical Memory Usage
● Puppet state (state.yaml)
● Apache Stats (if have GUI/API)
● Network performance and stats (errors, throughput, etc.)
● NTP time and drift (more important on VM's)

Our Platform Architecture (simplified)Our Platform Architecture (simplified)

Known IssuesKnown Issues (and complaints)(and complaints)
● Number of workers on smaller (1-2 core) systems easily overloaded
● No remote workers (yet)
● Still have to restart to add new hosts/services
● No REST API natively
● Livestatus (or similar) not native

Questions ?Questions ?
● Daniel.Wittenberg@ipsoft.com
● dwittenberg2008@gmail.com
● @dwittenberg2008
● www.linkedin.com/in/dwittenberg
● nagios and nagios-devel IRC
● Nagios Users and Devel mailing lists
● Always looking to hire new people so contact me!

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

Recommended

Recommended

More Related Content

More from Nagios

More from Nagios (20)

Recently uploaded

Recently uploaded (20)

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4