Ideas of CMDB, cfengine and nagios integration

For a while we’ve been discussing how we can become as lazy as possible when it comes to systemadministration, and this time we’ve made a quite neat integration between our homemade CMDB, cfengine and nagios.

Here’s the idea:

First of all, nobody likes to manually update a CMDB. Also its never really possible to maintain it in a way that makes its info become obsolete after some time. This is why we made a script, cfcmdb, that is triggered from cfengine on every host. This script fills the CMDB database with all sort of info from tools like dmidecode and also from standard commandline tools. (Memory, networkingcards, os version, cpu, vendor etc etc). So now our CMDB pretty much keeps itself up to date.

Lately we came up with the idea to fill our CMDB with cfengine classes information. So adding to the cfcmdb script mentioned above :

cfagent –no-splay -p -v | grep Defined

..and a little perl split and join, we now have all the classes in our CMDB bount to hostid’s.

Cool. On our nagios-server, we made another script, cmdb2nagios, which takes the parameters “hosts”, “hostgroups” or “services”.

cmdb2nagios hosts : creates the nagios host-config file

cmdb2nagios hostgroups : creates the nagios hostgroup-config file

cmdb2nagios services : creates the servicefile

The services parsing is quite nice now, cause we can automatically monitor any services set up with cfengine. Lets say we have a bunch of hosts installed with cfengine and cfengine tells them to have apache2 running. That means that this will be part of a cfengine class, that will be available in our CMDB.

Example of cmdb2nagios service parsing :

[snip]

$sql = “select hosts.name from hosts,classes where classes.name = ‘class_apache’ and hosts.hostid = classes.hostid”;
$execute = $connect->query($sql) or die “wtf? it didnt work …check syntax.”;
my @servicehosts;
while (@results = $execute->fetchrow()) {
push(@servicehosts, $results[0]);
}

$hosts = join(”,”,@servicehosts);
print “define service{\n”;
print “\tuse\t\t\tgeneric-service\n”;
print “\thost_name\t\t” . $hosts . “\n”;
print “\tservice_description\tcfg_CHECK_APACHE\n”;
print “\tis_volatile\t\t0\n”;
print “\tmax_check_attempts\t1\n”;
print “\tnormal_check_interval\t5\n”;
print “\tretry_check_interval\t1\n”;
print “\tcontact_groups\t\tlinux-admins\n”;
print “\tnotification_period\t24×7\n”;
print “\tnotification_options\tc,w,r\n”;
print “\tprocess_perf_data\t1\n”;
print “\tcheck_command\t\tcheck_apache\n”;
print “\t}\n\n”;
[snip]

As you can see, monitoring apache will be applied to all hosts running apache.

This leaves us to really only having to maintain our cfengine configuration, while the CMDB is auto-updated and the nagios-config is auto-parsed.

Also our eventhandlers in nagios tells cfengine to do this and that, so now we can sit back, enjoy a coffee and watch this show.

(see previous post about eventhandlers and cfengine : http://www.sladder.org/?p=261)

Tags: , ,

nagios-plugins and check_http version 1.4.12

Yes, it’s broken! I found out today after moving to a new nagios-server. My old http checks did not work, they only resulted in WARNINGS. Seems this is a redirect problem in check_http. It is fixed in the latest version.
I did not have time to go trough all the plugins this evening (friday and all) so I simply downloaded the latest tar.gz pluginpack from nagios.org, extracted it, ran “./configure” and then make all. After that I just moved the check_http from the freshly compiled plugin folder to my nagios plugins folder. Problem solved. 1.4.13 works fine.

Tags: , ,

A quick note about the Nagios check_procs plugin

I had an issue today where check_procs returned “0 processes with command name ‘some-process-name’ “. When I did a ps -ef | grep some-process-name I could see the process running, so I was like “wtf?!”, but then a collegue told me he had seen this before. Seems the check_procs uses something like ps -cA to check for running-processes and using those parameters, you only see part of the process name, so while it wouldnt match “some-process-name” it would match “some-process-n”. So if you wonder why it returns 0 processes, do a ps -cA to find the right process name to match in your config.

Tags: , ,

Monitoring web apps using html forms for logging in

Several times I’ve wanted to monitor if login actually works on webapplications that uses html forms for user validation. Today I made a simple bash nagios plugin to do that. It uses curl and checks it’s output. Curl supports POST variables.
The plugin also checks the unixtime before and after the curl command is being run, then does and expr to find the diff and makes nagios performance data of it.

Feel free to copy paste the code and use it for your own purpose.

#!/bin/sh
#my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);
#This is what we get after a successful login
MATCH="bladiblah successful login bladiblah"
#delete the tmp file before writing a new one
rm /tmp/login_check
#Grab the unixtime before the command runs
BEFORE=`date +%s`
#Login to the app
/usr/bin/curl -H "host: example.com" -F mail="mail@example.com" -F userPassword="fail" http://111.111.111.111 -o /tmp/login_check > /dev/null 2>&1
#Grab the unixtime after the command have been run
AFTER=`date +%s`
DIFF=`/usr/bin/expr $AFTER - $BEFORE`
#Check tmp file if it matches our successful login
CHECK=`grep "$MATCH" /tmp/login_check`
if [ "$CHECK" = "$MATCH" ]; then
#WIN!!
echo "OK. Login successful. | response=$DIFF"
exit 0
else
#FAIL!!
echo "CRITICAL. Login failed."
exit 2

And as always, sorry about the wordpress formating of the code.

Tags: , , , , , , ,

Monitoring Dell hardware in Nagios on the Debian Etch 64bit Platform

Getting Dell’s linux-software to work on other platforms than Redhat can be a bitch, but I was lucky to come across this article. Following those steps I had dellomsa up and running a couple of minutes later.
Then I went to nagiosexchange and downloaded the perl script check_dell_sensors.pl.

In this case I use check_by_ssh to run the check on each host, so here’s the command.cfg setup

define command{
command_name    check_dell_sensors
command_line    $USER1$/check_by_ssh -l someuser -t 30 -H $HOSTADDRESS$ -C "/usr/lib/nagios/plugins/check_dell_sensors.pl"
}

and finally the services.cfg


define service{
use                             generic-service
hostgroup_name                  cfg_CHECK_DELL_SENSORS
service_description             CHECK_DELL_SENSORS
is_volatile                     0
max_check_attempts              3
normal_check_interval           5
retry_check_interval            1
contact_groups                  linux-admins
notification_period             24x7
notification_options            c,w,r
check_command                   check_dell_sensors
}

This will monitor your dell hardware on all hosts in your cfg_CHECK_DELL_SENSORS hostgroup, giving you an output like :
/usr/lib/nagios/plugins/check_dell_sensors.pl
OK -- Hardware Log=Ok; Memory=Ok; Power Supplies=Ok; Processors=Ok; Temperatures=Ok; Voltages=Ok;

Tags: , , ,

Monitor it with nagios, fix it with cfengine.

This is something we started doing some time during the end of last year. We wanted to have nagios and cfengine to cooperate. We didn’t want cfengine to monitor if processes were running, cause thats nagios’ job and we didnt want nagios to fix problems occuring, cause that’s cfengine’s job ….SO ..we found out that cfrun could help us out with this problem and make a simple integration. Here’s how we did it.

In this particular scenario, we had some segfaults in our apache logs caused by some PHP errors we couldnt fix, which ended up in apache spawning a lot of child processes and giving us lots of defuncts and we had to restart apache every now and then.

So first we configured a check in our services config file in nagios. Something like this :

services.cfg

define service{
use generic-service
host_name webserver.somedomain.com
service_description CHECK_LOG_SEGFAULT
is_volatile 0
max_check_attempts 1
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_period 24x7
notification_options c,w,r
process_perf_data 1
check_command check_log_segfault
event_handler restart-apache
}

Now configure the commands. One command for the event-handler,n ame it “restart-apache” which is what the “event_handler” option in the example above says. One command for the logcheck, “check_command check_log_segfault”

commands.cfg :

define command{
command_name restart-apache
command_line /usr/bin/sudo /usr/sbin/cfrun $HOSTNAME$ -T -- -q -D restart_apache2_now
}
define command{
command_name check_log_segfault
command_line $USER1$/check_by_ssh -l root -t 30 -H $HOSTADDRESS$ -C "/usr/lib/nagios/plugins/check_log -F /var/log/apache2/error.log -O /var/log/apache2/check_log_oldlog -q Segmentation"
}

(The check_log command is being run on every host that needs it, but you could for instance call it via net-snmp’s EXEC function if you dont want to use ssh. NRPE is prolly also an alternative).
Be sure to enable eventhandlers in nagios.cfg for this to work.

nagios.cfg :

enable_event_handlers=1

Thats what’s needed for nagios. Let’s conf some cfengine.

In the nagios config you can see we’re running the cfengine class “restart_apache2_now”, so lets create a cfengine class with the same name.

cf.apache2 :

###############################################################
control:
actionsequence = ( packages shellcommands )
AddInstallable = ( has_apache2 )
IfElapsed = ( 0 )
###############################################################
classes:
###############################################################
packages:
debian::
apache2
pkgmgr=dpkg
define=has_apache2
################################################################
shellcommands:
# apache2 initscript
# Usage: /etc/init.d/apache2 {start|stop|restart|reload|force-reload}
debian.has_apache2.restart_apache2_now::
"/etc/init.d/apache2 restart"

Be sure to include this class in your cfengine config so that cfengine knows about it.

So now nagios monitors the logfile, checks for segfault messages and tells cfengine to restart apache if a segfault is found. (The nagios plugin check_log takes care of comparing new and old segfault messages, so that’s nothing to worry about). Everyone is happy and we (the sysadmins) dont have to do shit. Just the way we want it.

Tags: , , ,

Monitoring videostreams with Nagios

Recently I wanted to monitor a streamingservice via Nagios and started thinking about how it could be done on the commandline so it could easily be monitored without doing silly gui stuff. Came to think of mplayer and its ability to play streams ..and so I started playing with it for a bit.

Running the command : mplayer -noframedrop -quiet -dumpstream "http://someurl" -dumpfile "some_local_dumpfile" does everything i need. It streams the video and dumps it in a local file. This means you can assume that the stream is working if your dumped file is the right size. To find the size, just simply dump the video one time, check its size and use the size as a parameter to the nagios plugin.

Here’s the plugin. Feel free to copy & paste it and use it for your own monitoring.
You will ofcourse have to have mplayer installed.

#!/usr/bin/perl -w
# This check uses mplayer to dump a videostream, check the size of it
# and determine if the streaming service is working or not.
# ...uh ...yeah :-)
#morten bekkelund 2008

use Getopt::Long;
use File::stat;

my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);

sub print_usage {
print "Usage: check_xstream -u <url> -d <dumped stream filename> -s <expected size of dumped file> \n";
print "Example: ./check_xstream -u mms://streamserver/stream -d /tmp/dump -s 4533646 \n";
}

sub help {
print_usage();
}

Getopt::Long::Configure ("bundling");
GetOptions(
'u:s' => \$url, 'url' => \$url,
'd:s' => \$dump, 'dump' => \$dump,
's:i' => \$expected_size, 'size' => \$expected_size
);

if(!$url or !$expected_size or !$dump) {
print_usage();
exit $ERRORS{"UNKNOWN"};
}

unlink($dump);
my $check_stream=`/usr/bin/mplayer -noframedrop -quiet -dumpstream "$url" -dumpfile "$dump" 2>&1`;
my $file_size=stat("$dump")->size;

if(!$file_size) {
print "UNKNOWN: Cannot find dumped stream.\n";
exit $ERRORS{"UNKNOWN"};
}

if($file_size != $expected_size) {
print "CRITICAL: The size of the stream diffs from the expected size. Streaming doesnt appear to work correctly.\n";
exit $ERRORS{"CRITICAL"};
}

if($file_size == $expected_size) {
print "OK: The size of the stream is correct. Streaming appears to work correctly.\n";
exit $ERRORS{"OK"};
}

print "UNKNOWN: Something really fishy is going on here....\n";
exit $ERRORS{"UNKNOWN"};
# end

You can call the script check_stream or whatever you prefer and put it in your nagios plugins directory.

Configure commands.cfg for the new plugin :
define command{
command_name check_stream
command_line $USER1$/check_stream -u "$ARG1$" -s "$ARG2$" -d "$ARG3$"
}

and finally configure your services file to something like this :

define service{
use generic-service
host_name yourhost
service_description CHECK_STREAM
is_volatile 0
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups yourcontacts
notification_period 24x7
notification_options c,w,r
process_perf_data 1
check_command check_stream!"mms://someurl"!expected_dump_size!"/tmp/streamdump"
}

Tags: , , , ,