SourceFiles.org - Use the Source, Luke
Home | Register | News | Forums | Guide | MyLinks | Bookmark

Related Sites

Latest News
  General News
  Reviews
  Press Releases
  Software
  Hardware
  Security
  Tutorials
  Off Topic


Back to files

INTRODUCTION

The Eye Of Horus is a monitoring and alerting tool for computers. It's mainly useful for monitoring network services (eg, HTTP or SMTP servers) and the internal status of Unix servers (eg, load, disk usage, process counts).

In that respect, it's a lot like [Nagios](http://www.nagios.org/), but in my opinion it's better. It lacks a few features Nagios has, but it is a very simple architecture to which they can easily be added.

It's a flexible thing made from independent modules with well-defined interfaces, making it easy to customise and extend, but out of the box it'll monitor your servers and produce a nice [HTML summary][status-screenshot]
of their status - OK, the looks need a bit of work, but that will come soon, and it can optionally integrate with the excellent (and I mean excellent) [RRDTool][rrdtool] to store logs of statistics (response times, number of packages with known security holes, etc) - and link from the status page to nice graphs of the historical behaviour of these statistics.

Also, it's really easy to add new service checks to it.

HOW IT WORKS

The core of the system is `horus-check.py`, a Python script which reads a configuration file (specified on the command line). The configuration file specifies a list of services - either network services, in which case the host to run the check from and the host to run the check 'at' are specified, or local services, in which case only the host to run the check from need be specified. In either case, if the host to run the check from is not specified, then it defaults to the local host.

The service types reference definitions in a file which is referenced from the configuration file. In the service definitions file, a shell command to check the service is given; this command must output service status in a defined format, as a single-line YAML list. The list must contain, at least, a single-word status (OK, WARNING, FAILURE, or UNKNOWN), then optionally numeric statistics, then optionally a status message. For example:

      [OK]
      [UNKNOWN]
      [OK, { load: 0.5, users: 3 }]
      [WARNING, { load: 3, users: 30 }]
      [FAILURE, { load: 95, users: 300 }]
      [UNKNOWN, { }, Could not find AWK executable]

When a check is to be performed from a remote host, Horus opens an ssh connection to that host. It is assumed that the user horus is run as will have an ssh key set up to enable it to ssh to all such hosts without requiring a password.

Having performed the checks, horus-check.py then:

  1. Reads in the status database named in the configuration file
  2. Updates the status database with the new status of hosts
  3. Computes an overall system status (the worst non-unknown status of any checked service)
  4. Examines the service dependencies, and marks any service whose state is no worse than might be expected (eg, no worse than the worst state of a service it depends upon) are automatically marked as 'quiet'
  5. Computes a list of differences between the old and new status (services added, services removed, services whose status has improved, services whose status has worsened)
  6. If there are any differences, invokes a notification script (named in the configuration file) with them, along with the overall status
  7. Invokes a logging script (named in the configuration file) with the new value of every statistic reported by the service checks; I will soon provide a sample logging script that uses RRDTool to generate nice graphs.

The status database (which is written in YAML, so easily accessible to user scripts) can then be used to generate HTML status report (see [status.cgi][status-screenshot]).

INSTALLATION

Requires [PyYAML](http://pyyaml.org/wiki/PyYAML)

Copy and edit `example.conf` to suit your setup. Perhaps edit `types.conf` to add extra service types, if required, or change the commands to work on your systems.

Write your own change notification script(s), that accept a human-readable summary of the changes on stdin, and do something useful like email or SMS them on, then reference them in the `notify-commands` field of the configuration file.

Write your own parameter change notification script(s), that accept command line arguments like the supplied sample `log.sh`, and do something useful like update an RRDTool log, then reference them in the `param-log-commands` field of the configuration file.

Write your own scripts that parse the file specified in the `status-database` field of the configuration and produce funky system status displays. Try `status.cgi` as a starting point.

Run `python horus-check.py <myconfig>` at regular intervals, perhaps every five minutes from cron.

Set up `status.cgi` somewhere Apache will find it (edit it to point to the correct location of your `status.db` file) and you'll have a status report accessible via the Web. You can give GET parameters on the URL to filter the results:

  • `host=`hostname (to only show services on that host)
  • `type=`type (to only show services of that type)
  • `status=`OWUF (to only show services in a given set of statuses, eg

    WUF to only show warning, unkown, or failed services)

All the files are in YAML format, and have fairly self-explanatory structures, although I shall document them when they stabilise...

CONFIGURATION

The configuration file is in [YAML](http://www.yaml.org).

It has two top-level headings:

services
config

Under services should be a list of services to check, and under config, the paths to various other files are specified:

services
  • type: load params-ok: { load1: [0,2] }
  • type: zombies
config

status-database: status.db status-conf: status.cgi:rrdbase = rrd param-log-commands: [./rrdlog.py] param-log-conf: rrdlog.py:rrdbase = rrd step = 300 notify-commands: [./smsnotify.py] notify-conf: smsnotify.py:to = 44555123456 type-database: types.conf log: horus.log

Translated, that says to check for local load (and to count the one-minute load average param returned from the load service checker being within the range 0..2 inclusive as 'OK', overriding the default set in the service types file), and to check for local zombie count. Then various file names are specified in the configuration section, along with configuration for other components of the system - `status-conf` is copied verbatim into the status database file, to configure tools that process it; param-log-conf is passed to the parameter logging commands; the notify-conf is passed to the notify commands.

Note that parameter ranges may be made open-ended, by using `-inf` as the minimum or `+inf` as the maximum.

Every service declaration must specify a type, but all other fields are optional. A full list of fields used by Horus itself is given in this example:

However, any other fields mentioned are passed to the service checker itself.

The `status-conf` parameter of a service is copied verbatim to the status database file; it should be a list of tools that view the status database, with per-service configuration for each beneath. In this case, we ask the `status.cgi` Web reporting module to link this service to a specified URL.

Also, services may have child services. The child services are those that depend on the 'parent'; if a service's status is not worse than all of its parent services, then it is not considered worth notifying, and is automatically marked as 'quiet'. Eg, a database server might have a dynamic Web site as a child service. If the database enters a WARNING state due to overload, then when the web site goes into WARNING since the response time is worsening, this fact will not be alerted; however, if the web site went to FAILURE while the database was still just WARNING, this would generate a notification, since the web site is in a worse state than would be expected from the state of the database.

This is specified like so:

  • type: pgsql host: db.example.com user: horus pass: fnargle
    children
    • type: http host: www.example.com url: /test-db.php success-regex: "Database is OK"
    • type: http host: internate.example.com error-regex: "Database error"

The service types file is much simple. See the supplied `types.conf` for an example. Each service has just three properties:

zombies
command: |

ps -ax | awk -- " BEGIN { count = 0} { if (\$3==\"Z\") count = count +1; } END { print \"[OK, { zombies: \" count \" }]\" }" params-ok: { zombies: [0,5] } params-warn: { zombies: [0,20] }

The `command` property gives the shell command to run, `params-ok` lists the range of resulting parameter values which are considered OK, and `params-warn` lists the range of parameter values which, if not otherwise considered OK, are considered as worthy of a warning; and any parameter values outside of both ranges is considered a FAILURE case.

Note that the zombies service always inherently reports 'OK', but that this may be overridden by the system if the parameters are out of range. This is in contrast to the system Nagios uses, where each service checker plugin is responsible for having allowed ranges specified to it as command line parameters, and it computing its own resulting status by doing the range checks itself. Horus avoids duplicating effort by keeping the service checker simple, and having the system worry about acceptable ranges.

[status-screenshot]: http://www.kitten-technologies.co.uk/projects/horus/website/status.html

[rrdtool]: http://oss.oetiker.ch/rrdtool/


Other Sites

Discussion Groups
  Beginners
  Distributions
  Networking / Security
  Software
  PDAs

About | FAQ | Privacy | Awards | Contact
Comments to the webmaster are welcome.
Copyright 2006 Sourcefiles.org All rights reserved.