• contact
  • linkblog
Home

Nagios, the monitoring tool that cried wolf

rene — Wed, 11/04/2009 - 09:49

Im finding the Nagios check_load check becoming slightly annoying and noisy during short bursts of high load on servers such as 4am when /etc/cron.daily jobs usually run. There are also other checks that I run where I dont care if they go into a WARNING state. Having too many alerts sent out can be detrimental to a server monitoring system as the poor person who gets the notifications will eventually consider Nagios crying wolf.

I need to know when servers hit a high load though I want to only be notified if its a sustained high load period. I also need to know when something is CRITICAL though dont really care if its in a WARNING state. If i did care if it was in a WARNING state then I would perhaps configure the check to use CRITICAL instead.

Anyhow, my Nagios configs below.

For a simple check load service check I set max_check_attempts which determines how many times Nagios will check the service if an error was to occur to a higher than usual value. If the check attempts reach the value of max_check_attempts and the service is still an error Nagios will change the service to a HARD state typically logging a CRITICAL alert. The default value of max_check_attempts is 5. Its also important to note that the duration between the checks is determined by retry_check_interval which defaults to 1 (minute).

In this example when the load average hits a sustained value of 8, Nagios will check the service 20 times every 1 minute. If the load is still at 8 or higher a CRITICAL alert will be sent.

define service {
        use                             generic-service
        hostgroup_name                  webservers
        service_description             check load
        check_command                   check_nrpe!check_load!5,5,5 8,8,8
        max_check_attempts              20
        notification_interval           0
}

To configure only CRITICAL alerts be sent to a contact I use the following contact configuration. The only real deviation from a standard configuration is service_notification_options which determines what levels (warning, critical, unknowns and recoveries). Ive removed w (warning) and u (unknown) as I only care for c (critical) and r (recovery) alerts.

define contact{
        contact_name                     rene
        alias                            Rene Cunningham
        service_notification_period      24x7
        host_notification_period         24x7
        service_notification_options     c,r
        host_notification_options        d,r
        service_notification_commands    notify-service-by-email
        host_notification_commands       notify-host-by-email
        email                            rene@rene.bz
        }

photos im taking

photo.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgSpiderman!photo.jpgJazz night the RSLChinese new year in melbournephoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpgphoto.jpg

connect with me

search rene.bz

what im reading

  • It’s going to take five years - six words that can save your startup
  • 5 Types of Emails You Should be Automatically Filtering
  • Google CEO Eric Schmidt Circa 1986
  • When CEOs Have Warren Buffett In Their Boardroom
  • How NodeJS saved my web application
  • Want more startup hubs? Show us your faces
  • Notes from a production MongoDB deployment
  • Debian refuses to package the embedded PHP library. Reason ? "it's a rotten language whose use should not be encouraged". WTF ?
  • MySQL and Memcached: End of an Era?
  • People Don't Hate Change - They Hate You Trying to Change Them
  • The Data Deluge
  • Palm Says Revenue Will Be Lower Than Expected, Cites Slow Sales
  • Do You Follow Too Many People On Twitter? Use ManageTwitter.
  • Future iPads To Have Front-facing Cameras, Flash (Bulbs, Not Software)
  • PHOTO: In "Life, below 600px," Paddy Donnelly talks
  • 5 Ways to Stop Second Guessing Yourself
  • I Don’t Want a Freaking Computer
  • Man Checks-In Everywhere But Foursquare Rehab
  • How to Kill a Radical Idea
  • MEETorDIE Quantifies The Cost Of Wasteful Meetings
more

what im bookmarking

  • VMware KB: Timekeeping best practices for Linux guests
  • Linux installation kickstart for Oracle database - Oracle Wiki
  • IBM developerWorks: Wikis - Linux for Power Architecture - RHEL5 - Root on dm-multipath device
  • jQuery: » The Official jQuery Podcast – Episode 13 – David Walsh
  • BBC - BBC World Service Programmes - Digital Planet, 16/02/2010
  • gdgt weekly 074 - gdgt
  • PXE virtual network with Virtualbox and Cobbler | number 9
  • bootstrapping Puppet from Cobbler | number 9
  • willypick @ MindSay double NAT
  • BBC iPlayer rejects open source plugins, takes Flash-only path • The Register
  • Puppet Red Hat Centos – puppet
  • Augeas — Main
  • IT Conversations | StackOverflow | Episode 84
  • IT Conversations | O'Reilly Media Gov 2.0 Summit | Panel: John Markoff, Vinton Cerf, Jack Dorsey, Tim Sparapani
  • Shot of Jaq » Jaqback, Issue 4
  • Shot of Jaq » Developing The Devop
  • TWiST #40 Bonus Interview with Penn State | This Week in Startups (TWiST)
  • The Pipeline 3: Jason Fried | 5 by 5
  • Risky Business #140 -- Former NSA tech director, info assurance, Brian Snow | Risky Business
  • TWiST #42 with Michael Robertson
more

podcasts im listening to

  • jQuery: » The Official jQuery Podcast – Episode 13 – David Walsh
  • BBC - BBC World Service Programmes - Digital Planet, 16/02/2010
  • gdgt weekly 074 - gdgt
  • IT Conversations | StackOverflow | Episode 84
  • IT Conversations | O'Reilly Media Gov 2.0 Summit | Panel: John Markoff, Vinton Cerf, Jack Dorsey, Tim Sparapani
more
  • contact
  • linkblog