running nagiosplugins via saltstacks peer communicationsystem

So …my previous post was  similar to this, but you most likely dont want to run the salt-master and nagios on the same server, so I had to find a way to let the nagios-server execute its plugins on hosts via the salt-master. This can be done using the python client api and saltstacks own peer communication system.

First of all, read this : http://docs.saltstack.com/ref/peer.html

Then check out my wrapper here : https://github.com/mortis1337/nagios-plugins/blob/master/check_by_saltpeer.py

Yay! Now you can throw away NRPE forever and stop using ssh-keys for the nagiosuser if you are doing that allready.

Nagiosplugins over zmq? I like it :)

Tags: , , , , ,

Running nagios-plugins via saltstack

I’m so sick of maintaining NRPE-config on my servers, and I dont really want root-sshkeys all over the place. Recently I discovered saltstack and started to play with it a bit. I came up with the idea of running Nagios(or Icinga) on the same server as my salt-master and so I created a little wrapper that lets me run nagios-checks via saltstack.

Here’s how it works.

This is my little wrapper-script written in python: https://github.com/mortis1337/nagios-plugins/blob/master/check_by_salt.py

The wrapper takes hostname, plugin and a timeoutvalue as arguments:

$ python check_by_salt.py -H examplehost -p “/path/to/existing/nagiosplugin arg1 arg2″ -t 10

The wrapper imports salt and runs commands on minions with cmd.run_all and returns the output and the exitcode.

For this to work as the nagios/icinga user, you will have to configure the client_acl for the user in the salt-master config, so go ahead and edit the master-configfile (default: /etc/salt/master)

Search for “client_acl” in the file and add this :

client_acl:
icinga:
- cmd.*

Yeeaaaap, thats quite the security risk right there, but read up on how to limit what can be done with the cmd-state in salt and atleast it will be safer than using ssh-keys :)

check_by_salt in combination with https://github.com/mortis1337/nagios-plugins/blob/master/check_disk_generic.py will instantly give you monitoring of all your disks with no clientside-configuration.

Use it if you like it and feel free to improve it.

 

 

 

Tags: , , , ,

Monitor Dell servers on Debian Squeeze with Nagios

Im just writing up this post because the dellomsa packages arent working with the new Debian Squeeze 6.0.

I had problems with the omreport command not giving me info of ex memory/psu/cpu. (omreport chassis info said No sensors found etc)

I used some hours to try to get it working with a newer dellomsa but that didnt work either.
Then i found some official Dell Ubuntu packages, which i found working excellent on Debian Squeeze as well:
dpkg -P dellomsa #Make sure dellomsa isnt installed.
echo 'deb http://linux.dell.com/repo/community/deb/latest /' | sudo tee -a /etc/apt/sources.list.d/linux.dell.com.sources.list
apt-get update
apt-get install srvadmin-base smbios-utils

You will also need the libsmbios2_2.2.13-0ubuntu4_amd64.deb from Ubuntu Lucid to get smbios stuff working.
dpkg -i libsmbios2_2.2.13-0ubuntu4_amd64.deb
/etc/init.d/dataeng start #if this starts, omreport works!

Now you have the newer Debian Squeeze Dell stuff working.

We have deployed our hwmonitoring of our Dell servers with check_openmanage and Nagios
Read more about the check_openmanage on the check_openmanage site (this is a great plugin btw!)

Resources:
http://folk.uio.no/trondham/software/check_openmanage.html
http://linux.dell.com/repo/community/deb/latest/

Tags: , , , , ,

Automatically create bugs in Jira with a Nagios eventhandler

The most important part about this is ….dont use it too often, but it CAN make sense on really critical events, like warnings/criticals on partitionspace. For instance, if your mysql server is running out of space on /var/lib/mysql and your operationsteam didnt see the WARNING/CRITICAL notification from Nagios, it might be a good idea to have the bug created in Jira to make it even more visible.

Here’s how you do it.
First of all, be sure to have eventhandlers enabled in Nagios.

Configure your commands.cfg file to have something similar to this :

define command{
command_name jira_eventhandler
command_line $USER1$/jira_eventhandler -a morten -s $SERVICESTATE$ -t $SERVICESTATETYPE$ -A $SERVICEATTEMPT$ -H $HOSTNAME$ -S $SERVICEDESC$
}

Configure your services.cfg to have something similar to this :
define service{
use generic-service
host_name myhost
service_description CHECK_DISK_ROOT
is_volatile 0
max_check_attempts 3
normal_check_interval 10
retry_check_interval 1
contact_groups linux-admins
notification_period 24x7
notification_options c,w,r
check_command check_remote_disk_nagios!10%!5%!/
process_perf_data 1
event_handler jira_eventhandler
flap_detection_enabled 0
}

And be sure to have the jira_eventhandler script in place. You can download mine here : jira_eventhandler

Tags: , , ,

Nagios Operations Dashboard

We had an idea of putting up an operations monitoring screen in our workplace for more effecient discover alerts from nagios.
We found a php nagios dash app which parses status.dat (Nagios Dashboard – PHP)
We modified this to only show criticals/warnings, made a ajax interface for refreshing/showing the data, and modified the css a bit (Thanks Jonas!)

Here is the modified version:
dash.tar

To install just extract this on your nagios server and edit the $file in nagios_get.php to your status.dat file of your nagios :)

Here is a demo of the dash in real surroundings:

operations dashboard

Tags: , ,

A quick note about the Nagios check_procs plugin

I had an issue today where check_procs returned “0 processes with command name ‘some-process-name’ “. When I did a ps -ef | grep some-process-name I could see the process running, so I was like “wtf?!”, but then a collegue told me he had seen this before. Seems the check_procs uses something like ps -cA to check for running-processes and using those parameters, you only see part of the process name, so while it wouldnt match “some-process-name” it would match “some-process-n”. So if you wonder why it returns 0 processes, do a ps -cA to find the right process name to match in your config.

Tags: , ,

htop, great alternative to regular top

Recently discovered “htop”, a fancyer “top”.

Check it out : apt-get install htop

htop screenie

Tags: , ,

Monitoring web apps using html forms for logging in

Several times I’ve wanted to monitor if login actually works on webapplications that uses html forms for user validation. Today I made a simple bash nagios plugin to do that. It uses curl and checks it’s output. Curl supports POST variables.
The plugin also checks the unixtime before and after the curl command is being run, then does and expr to find the diff and makes nagios performance data of it.

Feel free to copy paste the code and use it for your own purpose.

#!/bin/sh
#my %ERRORS=('OK'=>0,'WARNING'=>1,'CRITICAL'=>2,'UNKNOWN'=>3,'DEPENDENT'=>4);
#This is what we get after a successful login
MATCH="bladiblah successful login bladiblah"
#delete the tmp file before writing a new one
rm /tmp/login_check
#Grab the unixtime before the command runs
BEFORE=`date +%s`
#Login to the app
/usr/bin/curl -H "host: example.com" -F mail="mail@example.com" -F userPassword="fail" http://111.111.111.111 -o /tmp/login_check > /dev/null 2>&1
#Grab the unixtime after the command have been run
AFTER=`date +%s`
DIFF=`/usr/bin/expr $AFTER - $BEFORE`
#Check tmp file if it matches our successful login
CHECK=`grep "$MATCH" /tmp/login_check`
if [ "$CHECK" = "$MATCH" ]; then
#WIN!!
echo "OK. Login successful. | response=$DIFF"
exit 0
else
#FAIL!!
echo "CRITICAL. Login failed."
exit 2

And as always, sorry about the wordpress formating of the code.

Tags: , , , , , , ,

Monitoring Dell hardware in Nagios on the Debian Etch 64bit Platform

Getting Dell’s linux-software to work on other platforms than Redhat can be a bitch, but I was lucky to come across this article. Following those steps I had dellomsa up and running a couple of minutes later.
Then I went to nagiosexchange and downloaded the perl script check_dell_sensors.pl.

In this case I use check_by_ssh to run the check on each host, so here’s the command.cfg setup

define command{
command_name    check_dell_sensors
command_line    $USER1$/check_by_ssh -l someuser -t 30 -H $HOSTADDRESS$ -C "/usr/lib/nagios/plugins/check_dell_sensors.pl"
}

and finally the services.cfg


define service{
use                             generic-service
hostgroup_name                  cfg_CHECK_DELL_SENSORS
service_description             CHECK_DELL_SENSORS
is_volatile                     0
max_check_attempts              3
normal_check_interval           5
retry_check_interval            1
contact_groups                  linux-admins
notification_period             24x7
notification_options            c,w,r
check_command                   check_dell_sensors
}

This will monitor your dell hardware on all hosts in your cfg_CHECK_DELL_SENSORS hostgroup, giving you an output like :
/usr/lib/nagios/plugins/check_dell_sensors.pl
OK -- Hardware Log=Ok; Memory=Ok; Power Supplies=Ok; Processors=Ok; Temperatures=Ok; Voltages=Ok;

Tags: , , ,

Tools for monitoring I/O

Allright so you wanna check I/O on your machine, where it all goes and all that kinda shit.

apt-get install iotop sysstat

Then you have two tools for checking your I/O problems:

First one: iotop

This one shows which processes read and write, how much they r/w.
This one is really good if you wonder which procs are making your machine stall.

Second one: iostat -m 1

This one shows you the I/O for devices and partitions. Very nice if you wanna check which disks are written/read to in a sw-raid or lvm array

Please comment on other useful tools that should be mentionable :)

Tags: , ,