SlideShare a Scribd company logo
1 of 18
Scaling Nagios 4
Daniel Wittenberg
daniel.wittenberg@ipsoft.com
About MeAbout Me
● Unix/Linux admin since mid 90's
● Nagios/Netsaint user since early 2000's
● Owned/operated consulting business for almost 10 years that
provided distributed monitoring using Nagios
● Previously employed by Fortune 50 Insurance company
● Currently Monitoring Platform Manager at IPsoft Inc.
About IPsoftAbout IPsoft
● Provider of Remote Infrastructure Management and automation
services
● ITIL and 6 Sigma compliance management framework
● Automation that resolves 56% of all incidents, and 90% L1
● Monitoring, Automation, Event Correlation, Management....
● Offices around the world in ten countries
● http://www.ipsoft.com
Last year...Last year...
My ConfigurationMy Configuration
● ~700 Nagios Servers
● ~130,000 Monitored Devices
● ~3,000,000 Service Checks
● Mix of customized Nagios 3.2.3 and 4.0.0
● Scientific Linux 6.2/6.4
● Managed by Puppet 3.x
● 2/3 on VMware ESX rest are bare metal
● Adding new Nagios servers almost daily
What's different with Nagios 4What's different with Nagios 4
SPEED!
● Current testing shows on average 500% faster over 3.2.3
What's different with Nagios 4What's different with Nagios 4
Some things that would impact performance/stability
http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html
● Embedded Perl – Gone
● external_command_buffer_slots - Gone
● -x option to not verify circular paths no longer needed in rc scripts
● Configuration Verification algorithm changes, massive startup speed increase
● Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres.
● Disk I/O reduced to virtually 0
● NEW query handler interface, better communication with core
● NEW core workers – reduces I/O, memory, CPU
● Completely re-written spec file for better installs, debug modes
Perf Testing Lab SetupPerf Testing Lab Setup
● Servers are all ESX 5 based VM's on the same cluster
● Variable CPU cores, 4GB memory
● Metrics used to consider a test failure:
● CPU Block Queue > 3
● CPU I/O Wait > 3
● CPU Idle < 10%
● Service Check Latency > 1s
● Host Check Latency > 1s
● 30 minute run time, > 3% failure rate failed the test
● Fully automated increasing work load, consistent results
● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.
Test Lab ArchitectureTest Lab Architecture
Test ResultsTest Results
CPU Cores Service Checks
Version 3.2.3
Service Checks
Version 4.0.0rc1
Difference
1 1700 10500 617%
2 3300 20800 630%
4 6500 35300 543%
8 11700 45100 385%
Other software usedOther software used
● Customized livestatus based on Andreas updates for Nagios 4
● https://github.com/ageric/livestatus
● Developing custom “single pane” interface to replace CGI/Check_mk Multisite
● Developing full REST API to talk to QH, livestatus and config files
● nagios-qh.rb Query Handler interface to gather loadctl metrics
● https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb
● Custom load control daemon that talks to QH
● Custom Event Broker to send perf data directly to ActiveMQ for post-
processing
● Custom agent, like NRPE on steroids without limitations like buffer size
Other performance tweaksOther performance tweaks
● Sysctl Changes
● net.ipv4.tcp_fin_timeout
● net.ipv4.tcp_keepalive_profiles
● net.ipv4.tcp_tw_recycle
● net.ipv4.tcp_tw.reuse
● No longer need RAMDISK, but still in the default sysconfig/RC script for now
● Keep logging levels as low as possible
● Disable CGI's whenever possible
● Disable Environment Macros
● Don't use resource macros when you don't need to, they are not cached
Other performance tweaksOther performance tweaks
● /etc/security/limits.d/nagios.conf
● ipmon soft nofile 131072
● ipmon hard nofile 131072
● ipmon soft nproc 131072
● ipmon hard nproc 131072
● Nearly disable OOM killer for the nagios process, saves it until last
● echo '-16' > /proc/<nagios pid>/oom_adj
● Re-nice puppet to run at 10 so less impacting (true for any extra services)
● /etc/sysconfig/puppet – NICELEVEL=10
● This should apply to any other running services that might take resources
Common Perf ToolsCommon Perf Tools
● vmstat / top – cpu/memory
● iostat / iotop – disk usage
● iptraf - network
● sar – cpu/memory/disk
● strace – immediate debugging, also debugging QA
● esxtop – VM stats
● tuned – can dynamically tune system
● perf record -p <pid> / perf list / perf top -u nagios
How to keep it running goodHow to keep it running good
● Monitor everything...you can never have too much info!
● CPU load and CPU stats (idle/wait/user/system)
● Disk space, inodes free
● All application/system logs (apache, syslog, nagios.log, etc.)
● Hardware status
● Swap / Physical Memory Usage
● Puppet state (state.yaml)
● Apache Stats (if have GUI/API)
● Network performance and stats (errors, throughput, etc.)
● NTP time and drift (more important on VM's)
Our Platform Architecture (simplified)Our Platform Architecture (simplified)
Known IssuesKnown Issues (and complaints)(and complaints)
● Number of workers on smaller (1-2 core) systems easily overloaded
● No remote workers (yet)
● Still have to restart to add new hosts/services
● No REST API natively
● Livestatus (or similar) not native
Questions ?Questions ?
● Daniel.Wittenberg@ipsoft.com
● dwittenberg2008@gmail.com
● @dwittenberg2008
● www.linkedin.com/in/dwittenberg
● nagios and nagios-devel IRC
● Nagios Users and Devel mailing lists
● Always looking to hire new people so contact me!

More Related Content

More from Nagios

Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios
 
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios
 

More from Nagios (20)

Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
 
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

  • 1. Scaling Nagios 4 Daniel Wittenberg daniel.wittenberg@ipsoft.com
  • 2. About MeAbout Me ● Unix/Linux admin since mid 90's ● Nagios/Netsaint user since early 2000's ● Owned/operated consulting business for almost 10 years that provided distributed monitoring using Nagios ● Previously employed by Fortune 50 Insurance company ● Currently Monitoring Platform Manager at IPsoft Inc.
  • 3. About IPsoftAbout IPsoft ● Provider of Remote Infrastructure Management and automation services ● ITIL and 6 Sigma compliance management framework ● Automation that resolves 56% of all incidents, and 90% L1 ● Monitoring, Automation, Event Correlation, Management.... ● Offices around the world in ten countries ● http://www.ipsoft.com
  • 5. My ConfigurationMy Configuration ● ~700 Nagios Servers ● ~130,000 Monitored Devices ● ~3,000,000 Service Checks ● Mix of customized Nagios 3.2.3 and 4.0.0 ● Scientific Linux 6.2/6.4 ● Managed by Puppet 3.x ● 2/3 on VMware ESX rest are bare metal ● Adding new Nagios servers almost daily
  • 6. What's different with Nagios 4What's different with Nagios 4 SPEED! ● Current testing shows on average 500% faster over 3.2.3
  • 7. What's different with Nagios 4What's different with Nagios 4 Some things that would impact performance/stability http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html ● Embedded Perl – Gone ● external_command_buffer_slots - Gone ● -x option to not verify circular paths no longer needed in rc scripts ● Configuration Verification algorithm changes, massive startup speed increase ● Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres. ● Disk I/O reduced to virtually 0 ● NEW query handler interface, better communication with core ● NEW core workers – reduces I/O, memory, CPU ● Completely re-written spec file for better installs, debug modes
  • 8. Perf Testing Lab SetupPerf Testing Lab Setup ● Servers are all ESX 5 based VM's on the same cluster ● Variable CPU cores, 4GB memory ● Metrics used to consider a test failure: ● CPU Block Queue > 3 ● CPU I/O Wait > 3 ● CPU Idle < 10% ● Service Check Latency > 1s ● Host Check Latency > 1s ● 30 minute run time, > 3% failure rate failed the test ● Fully automated increasing work load, consistent results ● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.
  • 9. Test Lab ArchitectureTest Lab Architecture
  • 10. Test ResultsTest Results CPU Cores Service Checks Version 3.2.3 Service Checks Version 4.0.0rc1 Difference 1 1700 10500 617% 2 3300 20800 630% 4 6500 35300 543% 8 11700 45100 385%
  • 11. Other software usedOther software used ● Customized livestatus based on Andreas updates for Nagios 4 ● https://github.com/ageric/livestatus ● Developing custom “single pane” interface to replace CGI/Check_mk Multisite ● Developing full REST API to talk to QH, livestatus and config files ● nagios-qh.rb Query Handler interface to gather loadctl metrics ● https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb ● Custom load control daemon that talks to QH ● Custom Event Broker to send perf data directly to ActiveMQ for post- processing ● Custom agent, like NRPE on steroids without limitations like buffer size
  • 12. Other performance tweaksOther performance tweaks ● Sysctl Changes ● net.ipv4.tcp_fin_timeout ● net.ipv4.tcp_keepalive_profiles ● net.ipv4.tcp_tw_recycle ● net.ipv4.tcp_tw.reuse ● No longer need RAMDISK, but still in the default sysconfig/RC script for now ● Keep logging levels as low as possible ● Disable CGI's whenever possible ● Disable Environment Macros ● Don't use resource macros when you don't need to, they are not cached
  • 13. Other performance tweaksOther performance tweaks ● /etc/security/limits.d/nagios.conf ● ipmon soft nofile 131072 ● ipmon hard nofile 131072 ● ipmon soft nproc 131072 ● ipmon hard nproc 131072 ● Nearly disable OOM killer for the nagios process, saves it until last ● echo '-16' > /proc/<nagios pid>/oom_adj ● Re-nice puppet to run at 10 so less impacting (true for any extra services) ● /etc/sysconfig/puppet – NICELEVEL=10 ● This should apply to any other running services that might take resources
  • 14. Common Perf ToolsCommon Perf Tools ● vmstat / top – cpu/memory ● iostat / iotop – disk usage ● iptraf - network ● sar – cpu/memory/disk ● strace – immediate debugging, also debugging QA ● esxtop – VM stats ● tuned – can dynamically tune system ● perf record -p <pid> / perf list / perf top -u nagios
  • 15. How to keep it running goodHow to keep it running good ● Monitor everything...you can never have too much info! ● CPU load and CPU stats (idle/wait/user/system) ● Disk space, inodes free ● All application/system logs (apache, syslog, nagios.log, etc.) ● Hardware status ● Swap / Physical Memory Usage ● Puppet state (state.yaml) ● Apache Stats (if have GUI/API) ● Network performance and stats (errors, throughput, etc.) ● NTP time and drift (more important on VM's)
  • 16. Our Platform Architecture (simplified)Our Platform Architecture (simplified)
  • 17. Known IssuesKnown Issues (and complaints)(and complaints) ● Number of workers on smaller (1-2 core) systems easily overloaded ● No remote workers (yet) ● Still have to restart to add new hosts/services ● No REST API natively ● Livestatus (or similar) not native
  • 18. Questions ?Questions ? ● Daniel.Wittenberg@ipsoft.com ● dwittenberg2008@gmail.com ● @dwittenberg2008 ● www.linkedin.com/in/dwittenberg ● nagios and nagios-devel IRC ● Nagios Users and Devel mailing lists ● Always looking to hire new people so contact me!