Nagios, the monitoring tool that cried wolf
rene — Wed, 11/04/2009 - 09:49
Im finding the Nagios check_load check becoming slightly annoying and noisy during short bursts of high load on servers such as 4am when /etc/cron.daily jobs usually run. There are also other checks that I run where I dont care if they go into a WARNING state. Having too many alerts sent out can be detrimental to a server monitoring system as the poor person who gets the notifications will eventually consider Nagios crying wolf.
I need to know when servers hit a high load though I want to only be notified if its a sustained high load period. I also need to know when something is CRITICAL though dont really care if its in a WARNING state. If i did care if it was in a WARNING state then I would perhaps configure the check to use CRITICAL instead.
Anyhow, my Nagios configs below.
For a simple check load service check I set max_check_attempts which determines how many times Nagios will check the service if an error was to occur to a higher than usual value. If the check attempts reach the value of max_check_attempts and the service is still an error Nagios will change the service to a HARD state typically logging a CRITICAL alert. The default value of max_check_attempts is 5. Its also important to note that the duration between the checks is determined by retry_check_interval which defaults to 1 (minute).
In this example when the load average hits a sustained value of 8, Nagios will check the service 20 times every 1 minute. If the load is still at 8 or higher a CRITICAL alert will be sent.
define service {
use generic-service
hostgroup_name webservers
service_description check load
check_command check_nrpe!check_load!5,5,5 8,8,8
max_check_attempts 20
notification_interval 0
}
To configure only CRITICAL alerts be sent to a contact I use the following contact configuration. The only real deviation from a standard configuration is service_notification_options which determines what levels (warning, critical, unknowns and recoveries). Ive removed w (warning) and u (unknown) as I only care for c (critical) and r (recovery) alerts.
define contact{
contact_name rene
alias Rene Cunningham
service_notification_period 24x7
host_notification_period 24x7
service_notification_options c,r
host_notification_options d,r
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
email rene@rene.bz
}































