Automatic Monitoring with Puppet and Nagios

Part of my current push at work right now is trying to get some sort of configuration management system in place, and in a usable state. Part of the reason for wanting to do this is consistency of common configuration among the many systems and virtual machines I manage, especially when several serve essentially the same function (such as webserver VMs). Since I’m a fallible human being, it’s easier for me to get one webserver configured so that the non-content part of the setup always passes our security scans, and then replicated that to all the other webservers. This not only saves me time, and makes me more efficient, but helps when we do system audits, as I can easily prove that all N systems I manage that serve the same role have identical configurations.

Another push I’ve been working on for a few years now is comprehensive monitoring of our systems. When I started at my job, monitoring was being done through a home-grown script, that would run once an hour. While not particularly bad, it had some major shortcomings. Primarily, the hour lag time meant that, if a problem occurred shortly after the script ran, you wouldn’t know about it for nearly an hour. Another problem was that, while the script would make note of system problems, the only notification mechanism was “someone needs to visit the web page it generates”. During a normal workday, that might be OK, but evenings and weekends? So, at least for my benefit, I set up a monitoring system that could not only monitor systems more frequently, but monitor more systems, and send out alerts via email, cell phones, etc.

For reasons I won’t get into here, what I chose to use for configuration management was Puppet, and I’d already chosen Nagios as my monitoring system. So far, my biggest problem with Nagios has been finding the time to add new systems to it, figuring out what services to check, etc. It’s not a particularly difficult thing to do, but in the grand scheme of things, it was just something that always fell by the wayside in the drive to get more systems set up, deal with user problems, and put out the inevitable fires. That is, until recently.

My original Nagios install was done on an aging box that has been serving as a development web server for our group. I did the install shortly after I’d started, so had I been around longer, I might have either waited, put in a request for new hardware, or found a way to get virtualization into our environment sooner. Overall, it worked, but it was in general a pain dealing with the box, a RHEL3 server, since it was so far out of date. Recently, I’ve set up a VM on our XenServer cluster to be our new Nagios box. Since I’ve also been playing with Puppet, I wanted to automate things as much as possible, since every new system created should be managed by Puppet (though reality slightly differs from that).

Fortunately, the groundwork for automating Nagios monitoring with Puppet is already built in to Puppet. It took me a little bit to wrap my head around the concepts, but the example helped and served as a base.

Now, even though I’m still in the process of setting things up, I’ve gotten to the point where my new Nagios server is already monitoring about 30% more information about hosts (422 services vs 308 on the old server), even though the number of hosts is currently about half (48 hosts vs 99 on the old). All of it done automatically, and here’s how.

First, set up stored configurations on your puppetmaster. You’ll need to specify a database in which to store your puppet-collected facts and resources. While the default is SQLite, I ran into problems with concurrent access. Since I’m also currently responsible for the handful of MySQL servers we have, I decided to just use one of those. Create a database and user for puppet to use, then tell the puppetmaster about it. Your [puppetmasterd] section of puppet.conf should look something like this when you’re done:

[puppetmasterd]
templatedir = /var/lib/puppet/templates
storeconfigs = true
dbadapter = mysql
dbuser = the_user_you_set_up
dbpassword = the_password_for_dbuser
dbserver = the_database_server
#dbsocket = /var/run/mysqld/mysqld.sock
downcasefacts = true
 

Your paths may likely be different than mine. If your DB server is running on the same host as the puppetmasterd process, dbserver should be “localhost”, and you’d uncomment and adjust the path of the dbsocket line. The downcasefacts line is set to “true” so that I can make use of the $operatingsystem fact later on without having to muck with changing the case later.

Next, you’ll want to create a nagios module in Puppet. The Exported Resource example linked above served as my template, but I’ve made a few changes to it. My puppet/modules/nagios/manifests/init.pp file currently looks like this:

class nagios {

   package {
      ‘nagios3′:
         ensure  => installed,
         alias   => ‘nagios’,
         ;
   }

   service {
      ‘nagios3′:
         ensure  => running,
         alias   => ‘nagios’,
         hasstatus       => true,
         hasrestart      => true,
         require => Package[nagios],
   }

   # collect resources and populate /etc/nagios/nagios_*.cfg
   Nagios_host <<||>>
   Nagios_service <<||>>
   Nagios_hostextinfo <<||>>

   class target {
      @@nagios_host { $fqdn:
         ensure => present,
         alias => $hostname,
         address => $ipaddress,
         use => "generic-host",
      }

      @@nagios_hostextinfo { $fqdn:
         ensure => present,
         icon_image_alt => $operatingsystem,
         icon_image => "base/$operatingsystem.png",
         statusmap_image => "base/$operatingsystem.gd2",
      }

      @@nagios_service { "check_ping_${hostname}":
         use => "check_ping",
         host_name => "$fqdn",
      }

      @@nagios_service { "check_users_${hostname}":
         use => "remote-nrpe-users",
         host_name => "$fqdn",
      }

      @@nagios_service { "check_load_${hostname}":
         use => "remote-nrpe-load",
         host_name => "$fqdn",
      }

      @@nagios_service { "check_zombie_procs_${hostname}":
         use => "remote-nrpe-zombie-procs",
         host_name => "$fqdn",
      }

      @@nagios_service { "check_total_procs_${hostname}":
         use => "remote-nrpe-total-procs",
         host_name => "$fqdn",
      }

      @@nagios_service { "check_swap_${hostname}":
         use => "remote-nrpe-swap",
         host_name => "$fqdn",
      }

      @@nagios_service { "check_all_disks_${hostname}":
         use => "remote-nrpe-all-disks",
         host_name => "$fqdn",
      }
   }
}
 

To use it, I simply do an include nagios in the node definition for my Nagios server in puppet, and in my basenode node definition, I’ve done an include nagios::target. Each of the @@ lines will collect information for each machine managed by puppet that inherits from basenode. The “collect resources and populate /etc/nagios/nagios_*.cfg” portion is the real magic, however. Each of those lines will cause puppet to collect all the matching resources, and output them to files in /etc/nagios. The only real caveat, which I also noticed in the example I built upon, is that I’m having trouble convincing puppet to reload nagios when the files are updated, which I just brute-force solved with a periodic cronjob to run nagios’ init script with “reload”.

I’m also slowly adding nagios entries for each service that puppet manages in some form. Currently, that means things like apache and ssh. For example, in my apache2 module’s init.pp, I have the following in my class:

   @@nagios_service { "check_http_${hostname}":
      use => "check-http",
      host_name => "$fqdn",
   }

   @@nagios_service { "check_http_processes_${hostname}":
      use => "remote-nrpe-httpd-procs",
      host_name => "$fqdn",
   }
 

This both monitors over-the-wire connections to port 80 on webservers, via the check-http command, but also monitors the number of httpd processes running on each host, via remote-nrpe-httpd-procs.

Similarly, for ssh, I have:

   @@nagios_service { "check_ssh_${hostname}":
      use => "check-ssh",
      host_name => "$fqdn",
   }
 

to monitor whether sshd is accepting connections on my systems.

And that, basically, is how I’m automatically monitoring all puppet-managed hosts in my environment. Whenever I set up a new host, I activate puppet on the host to ensure configurations I care about are synced to my master templates, and now as a bonus, puppet automatically tells nagios to start monitoring the services it knows about on the host. By expending a little extra effort once now, I’ve managed to be lazy later on multiple times over, truly something a Systems Administrator should be doing!

Be Sociable, Share!

27 Comments

  1. Pingback: Automatic Monitoring with Puppet and Nagios – Mike's Place | Drakz Free Online Service

  2. Pingback: uberVU - social comments

  3. Pingback: Tweets that mention Automatic Monitoring with Puppet and Nagios – Mike's Place -- Topsy.com

  4. Pingback: kb.hurricane-ridge.com / Bookmarks for January 28, 2010 through January 29, 2010

  5. Hi,

    You should look at Shinken, it’s a enhanced Nagios reimplementation in Python that allow you to have a quick and easy distributed and high availability monitoring environment, and of course with Nagios configuration and plugins compatibility :)

    It’s available (Open Source with a AGPL licence) at www.shinken-monitoring.org with even a demo virtual machine to test it in 5minutes :)

    Jean gabès, Shinken developper View all comments by Gabès Jean

  6. Pingback: Delicious Bookmarks for September 1st from 14:01 to 14:18 « Lâmôlabs

  7. Pingback: You've been Stumbled!

  8. Pingback: Reply to comment | Bitfield Consulting

  9. Pingback: Integrating icinga/nagios with puppet » Light at the end of the tunnel

  10. Pingback: High Scalability - High Scalability - Stuff the Internet Says on Scalability For November 29th, 2010

  11. Pingback: Choosing a monitoring system for a dynamically scaling environment: Nagios v. Zabbix - Server Fault

  12. Pingback: örgl mitten in der nacht! | jaja

  13. Pingback: Choosing a monitoring system for a dynamically scaling environment: Nagios v. Zabbix - monitoring, nagios, cloud-computing, cloud, zabbix - TechQues.com

  14. Pingback: Being “appropriately lazy” « Mike's Place

  15. Pingback: The Geekery » Bookmarks for July 15th through July 22nd - The Geekery

  16. Pingback: Puppet stored configurations and Icinga « Some Softwaremanagement

  17. You said:

    “The only real caveat, which I also noticed in the example I built upon, is that I’m having trouble convincing puppet to reload nagios when the files are updated, which I just brute-force solved with a periodic cronjob to run nagios’ init script with “reload”.”

    I’m sure you already know this after two years, but just in case you don’t: add

    notify => Service['nagios3']

    to each of your resource imports. E.g., you would do:

    Nagios_host <> {
    notify => Service['nagios3']
    }

    This way, when one of your files actually is updated, the nagios3 service will be restarted. View all comments by Wouter Verhelst

  18. Pingback: Juju and Nagios, sittin’ in a tree.. (Part 1) | FewBar.com – Make it good

  19. Hi

    I’m following your instructions but when nagios3 tries to start I get the following error:

    Error: Template ‘nrpe-zombie-procs’ specified in service definition could not be not found (config file ‘/etc/nagios3/conf.d/nagios_host.cfg’, starting on line 169)
    Error processing object config files!

    I think this happens because this service definition is not defined anywhere.

    Did I miss something?

    Thank you for your time

    Regards View all comments by Juan Sierra Pons

Leave a Reply