Monitor it with nagios, fix it with cfengine.

This is something we started doing some time during the end of last year. We wanted to have nagios and cfengine to cooperate. We didn’t want cfengine to monitor if processes were running, cause thats nagios’ job and we didnt want nagios to fix problems occuring, cause that’s cfengine’s job ….SO ..we found out that cfrun could help us out with this problem and make a simple integration. Here’s how we did it.

In this particular scenario, we had some segfaults in our apache logs caused by some PHP errors we couldnt fix, which ended up in apache spawning a lot of child processes and giving us lots of defuncts and we had to restart apache every now and then.

So first we configured a check in our services config file in nagios. Something like this :

services.cfg

define service{
use generic-service
host_name webserver.somedomain.com
service_description CHECK_LOG_SEGFAULT
is_volatile 0
max_check_attempts 1
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_period 24x7
notification_options c,w,r
process_perf_data 1
check_command check_log_segfault
event_handler restart-apache
}

Now configure the commands. One command for the event-handler,n ame it “restart-apache” which is what the “event_handler” option in the example above says. One command for the logcheck, “check_command check_log_segfault”

commands.cfg :

define command{
command_name restart-apache
command_line /usr/bin/sudo /usr/sbin/cfrun $HOSTNAME$ -T -- -q -D restart_apache2_now
}
define command{
command_name check_log_segfault
command_line $USER1$/check_by_ssh -l root -t 30 -H $HOSTADDRESS$ -C "/usr/lib/nagios/plugins/check_log -F /var/log/apache2/error.log -O /var/log/apache2/check_log_oldlog -q Segmentation"
}

(The check_log command is being run on every host that needs it, but you could for instance call it via net-snmp’s EXEC function if you dont want to use ssh. NRPE is prolly also an alternative).
Be sure to enable eventhandlers in nagios.cfg for this to work.

nagios.cfg :

enable_event_handlers=1

Thats what’s needed for nagios. Let’s conf some cfengine.

In the nagios config you can see we’re running the cfengine class “restart_apache2_now”, so lets create a cfengine class with the same name.

cf.apache2 :

###############################################################
control:
actionsequence = ( packages shellcommands )
AddInstallable = ( has_apache2 )
IfElapsed = ( 0 )
###############################################################
classes:
###############################################################
packages:
debian::
apache2
pkgmgr=dpkg
define=has_apache2
################################################################
shellcommands:
# apache2 initscript
# Usage: /etc/init.d/apache2 {start|stop|restart|reload|force-reload}
debian.has_apache2.restart_apache2_now::
"/etc/init.d/apache2 restart"

Be sure to include this class in your cfengine config so that cfengine knows about it.

So now nagios monitors the logfile, checks for segfault messages and tells cfengine to restart apache if a segfault is found. (The nagios plugin check_log takes care of comparing new and old segfault messages, so that’s nothing to worry about). Everyone is happy and we (the sysadmins) dont have to do shit. Just the way we want it.

Tags: , , ,