SourceFiles.org - Use the Source, Luke
Home | Register | News | Forums | Guide | MyLinks | Bookmark

Related Sites

Latest News
  General News
  Reviews
  Press Releases
  Software
  Hardware
  Security
  Tutorials
  Off Topic


Back to files

Presenting version 4.0 of urlmon, a URL monitor.

PLEASE read this file, at least until the line that says: "1. Description". There is one WARNING and one NOTE that you must read. Versions 3.x and higher are not completely compatible with earlier versions, so you need to know what to do.

Written by Jeremy Impson <jdimpson@acm.org>

with suggestions, bug reports/fixes, help, and/or code from:

  Jeff Lightfoot        <jeffml@pobox.com>      
  Markus Kohler         <kohlerm@betze.bbn.hp.com>
  Fred Maciel           <fred-m@crl.hitachi.co.jp>
  Jauder Ho             <jauderho@transmeta.com>
  Michael Wiedmann      <mw@miwie.in-berlin.de>
  Ted Serreyn           <tserreyn@pop.globaldialog.com>
  Eric Raymond          <esr@thyrsus.com>
  Bill Dyess            <bill@dyess.com>
  Robin Houston         <robin@oneworld.org>
  Peter Mardahl         <peterm@langmuir.EECS.Berkeley.EDU>
  Robert Richard George 'reptile' Wal
                        <reptile@reptile.eu.org>
  • See the file CHANGELOG.txt for version history and a list of new features for this release.
  • See the file FILTERS.txt to see how to use the filtering capability of urlmon.
  • See the file URLMONRC.txt for a full description of the urlmonrc format
  • See the file MODULES.txt for information on getting the right Perl Modules in order to run urlmon.
  • See section 'i. Parallel URL monitoring' of this file for information on the optional SystemV IPC for intercommunication between the parallel urlmon children (as opposed to the default method, writing out temporary files), among other things.

*-*->>WARNING<<-*-*

For versions 3.0 and higher, the format of the urlmonrc (aka last modified database) file has changed. Recent versions WILL convert it for you, but previous versions of urlmon WILL CHOKE VIOLENTLY on the new format, very likely destroying all your data.

BACK UP YOUR urlmonrc FILE IF YOU ARE UPGRADING FROM ANYTHING PRIOR TO VERSION 3.0 BEFORE RUNNING VERSION 3.0 !!!!!!!!!!!!!!!!!!!!!!!!!!!

NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE:NOTE Also, run the old version of urlmon once before running the new version. I have tried my hardest to avoid there being an false positives during the upgrade process, but I can't be sure. By running the old version immediately before the new, any modifications reported by the old version are real ones, and any reported by the new version are false ones (unless a URL got changed in the time it takes for the new one to run after the old one. Unlikely, but possible.)

If you are using urlmon for the first time, please pardon my shouting. None of this applies to you :)

  1. Description

urlmon was written by yours truly after I became interested in certain episodic web sites (kind of like the old radio serials that played back in the day...)

I got tired of constantly checking up on them, to see if the next issue had come out. I don't like to do too much web-surfing, as there are more important things in my life (like writing neat perl scripts). So I wrote urlmon.

urlmon makes a connection to a web site and records the last_modified time for that url. Upon subsequent calls, it will check the URL again, this time comparing the information to the previously recorded times. (Note that if the subsequent time is older (less than) the first, urlmon will still assume that the URL has been updated. I figured I'd play it safe.) Since the last_modified data is not required to be given by the http (it's optional), urlmon will then take an MD5 checksum. This is actually more accurate, as time stamps can be faked or inaccurate. According to the mathematicians, changing a string without changing its MD5 signature is HARD, and to do so would require drastic change to the string. Therefore, there is an option to force checksum information to be used instead of time stamps. But this is computationally more intensive.

New with version 3.0 and up, you can specify filters which can be applied to the contents of the URLs, so that a certain amount of formatting (a lot of formatting, if you are crafty) can be done to it before the checksum is made. In this way you can remove rotating advertisements and other dynamic data that would otherwise cause the checksum to report that the URL has changed. (Because it has, although it is likely nothing significant has been added to the page, especially if its an advertisement.) See the file FILTERS.txt for further explanation.

If possible, urlmon will request only the headers (HEAD) from each URL, with which to decide if the URL has changed. It will look for last_modified info. If it is there, urlmon can tell if the URL has changed. If there is no last_modified info in the headers (or if '-s' is supplied on the command line, which forces the use of checksums, or if use are using a content-filter), it will request all the data (GET) and compute a checksum. This saves bandwidth by only requesting as little as possible. However, if urlmon has a checksum in its database for that URL, which indicates that previously, no last_modified info was available (or is uses a content-filter), it will request all the data (GET) at the outset, to minimize the amount of network connections.

urlmon is able to do similar monitoring of FTP sites, because it uses the LWP module, which provides a more or less uniform interface to HTTP, FTP, and NNTP. (urlmon won't do NNTP yet, and probably won't, unless someone sends me a patch :) urlmon can also check local files by using the file:/path/to/file or file://localhost/path/to/file URL format.

2. Requirements

urlmon requires perl5 or greater (but what could be greater than Perl?), the MD5 perl module, and the LWP (Library for WWW access in Perl, aka libwww), which in turn requires MIME-Base64 (in the MIME directory), HTML-Parser (in the HTML directory), MD5 (in the MD5 directory), and libnet (in the NET directory) modules. These can all be found at any CPAN archive (check out http://www.perl.com/perl/CPAN/ in the modules section for the nearest mirror near you!). It also needs Getopt.pm and ctime.pl, but those should have come with your perl distribution.

See the file MODULES.txt for an indepth discussion of how to get and install the necessary Perl Modules.

You'll need to install these somewhere so that your script can 'use' or 'require' them (see the perl man pages for '@INC' for more info). However, the installation scripts that accompany the perl modules should take care of this for you.

And of course, you'll need some type of networking set up so that you can access the web servers you want to keep tabs on.

Finally, you'll need some method to invoke urlmon. It can be at the command line, but it is useful to make crontab entries (do a 'man crontab' on most Unix -variant, or 'man cron' if that doesn't work) for urlmon, and have the results sent to you via email. See some sample scripts in section 5.

3. Usage

Like any good UNIX command line utility, urlmon has quite a few options. First, here's a summary of its options. What follows is a better explanation of the uses of the command-line options.

Invoking 'urlmon -h' yields this:

Usage: urlmon [-Cdcso] [-P http://my.proxy.org] [-f urlmonrcfile] [-t timeout] [-F procs] [ <URL> | -l | -h | -p | -r <URL> ] <URL> is one or more space-delimited URL's to monitor (or remove). -l use the URLs in last_modified database (.urlmonrc) as targets to check. (Also uses URLs on command line.) -h print this stuff.
-r remove the URLs from database. Silent if URL is not in database. -s use only checksums to determine if URLs have been modified. -d run in debug mode.
-c curt output. Print nothing but the changed or new URLs. Shuts off -d. Prints errors on SDTERR. -p print out contents of the last_modified database (.urlmonrc). -f <file> use supplied file as last_modified database. -C assume that argument(s) will be url/comment pair(s). -P specifies a proxy server to use for ftp, gopher, and http connections. -t make 'timeout' the timeout length for all network connections. Defaults to 180 seconds. -F use 'procs' number of processes to monitor in parallel. Defaults to 1. -o save all new files to the current directoy. Notes:
For ftp and http connections, the following syntax is allowed for authenticated sessions:

      ftp://username:passwd@host.domain.org/  
      http://username:passwd@host.domain.org/
    The following environmental variables will be searched for proxy servers:
      http_proxy, gopher_proxy, wais_proxy, and no_proxy
  1. Basic usage.

The normal invocation is along the lines of

urlmon http://url.one.foo http://url.two.foo ...

or

echo http://url.one.foo | urlmon

You should only have one URL per line that goes into the standard input of urlmon.

These will cause urlmon to look up each URL and generate checkstamps. A checkstamp is either a timestamp or a checksum, both of which can be used to determine when a file was last updated. It will then look in the database file and see if the URLs are there. If not (i.e. urlmon has never been invoked on the URLs before), it will report that they are new (unless -c is given, see below) and add them to the database. If they are in the database, the newly generated checkstamp and the checkstamp recorded in the database will be compared. If they are the different, urlmon reports that the URL has changed. urlmon will also report any errors it encounters.

(Throughout this document, I use the word "checksum" to mean the result of applying any content-based filter, not just the checksum filter. See FILTERS.txt for more information.)

Invoking 'urlmon -r URLs' will remove the URL's from the database (and will be silent if any are not present in the database. I know, I'm just lazy!). And 'urlmon -rl' WILL delete the database, so don't do it unless you mean it.

b. Dealing with the URL database.

See the file URLMONRC.txt for a full description of the URL database.

(The following also appears in the file URLMONRC.txt, but since its pretty important, it should be in this, the official "README".)

Invoking urlmon on the urlmonrc database

        Invoking urlmon as listed in the section entitled 'Basic usage'
        of the README.txt file is a good way (well, the only way :) to
        add entries to the database.  But the power comes with the '-l'
        switch. This will cause urlmon to read in the URLs in its
        database file and treat them as though they were listed on the
        command line a la 'Basic usage'. In its simplest, the script

                urlmon -l | mail userid@host.foo

        executed regularly will keep 'userid' aware of the status of the
        URLs in his or her URL database.  However, it will also cause
        mail to be sent even if there are no URLs changed, so something
        slightly more complicated is needed. (See section 5 for such a
        script.) 

c. Scripting

I debated putting an 'executable' flag in urlmon, which would be set to some application that would be executed for every URL that was updated or added. But then I realized that it would be more useful to make urlmon condusive to scripting, rather than making it try to do everything itself. Therefore, giving it a '-c' option will cause it to print nothing but the URLs that have changed or are new, one URL per line. Any errors are are printed on STDERR. This option unsets '-d', described next.

d. Debug info

An argument of '-d' will cause urlmon to print out what it is doing, every step of the way. This is useful if it's not behaving the way you expect. The messages it prints are (in my subjective opinion) fairly descriptive, but a perusal of the source code will cause further enlightenment (as perusal through source code usually is).

If you have any bugs to report, send along an example of the bug with the debugging turned on.

e. Falibility

As explained before, time stamps information provided by web servers could be inaccurate or worse, faked (although why anyone would want to I don't know). An option of '-s' will cause urlmon to ignore timestamp information and generate and store checksums. When urlmon compares its database info to the newly information from the web server, if the webserver passed last_modified info but the database has a checksum, urlmon will ignore the last_modified info and do a checksum comparison. This way someone who wants to always be accurate can specify '-s' once for every URL, and always get accurate information.

I've had one weird case where when doing HEAD requests, the web server (one of Netscape's, I think) wouldn't send last_modified time, but it would when it received a GET request. In this case, urlmon will do decide that since there is no last_modified time when it does its initial (HEAD) request, it must use checksums (and do GET requests). I'm not going to adapt urlmon for this case as I think it is just too stupid.

Using filters can cause a bit of confusion at first. When applying a filter to an URL for the first time, even if the file hasn't changed, urlmon will report that it has. Also, if you change the type of filter you use (see FILTERS.txt), it is very likely that urlmon will think there is a change.

f. Comments in urlmonrc file

See the file URLMONRC.txt

g. Proxying

urlmon now utilizes proxying. It does so in two ways. You can either set the following environmental variables like so (from the LWP::UserAgent man page, which is utilized by urlmon):

         gopher_proxy=http://proxy.my.place/
         wais_proxy=http://proxy.my.place/
         no_proxy="my.place"
         export gopher_proxy wais_proxy no_proxy

       Csh or tcsh users should use the setenv command to define
       these envirionment variables.

(The man page doesn't mention this, but http_proxy seems to work, too.)

Note that if these environmental shell variables are set, they will be used automatically.

The other way is to explicitly specify the proxy server with the '-P' option, like this:

urlmon -P http://proxy.my.place/ http://url.to.monitor ...

This will override the variables set in the environment. Note also that this proxy will be used for all ftp, http, and gopher connections.

h. Timeouts

The default amount of time urlmon will wait on a connection is 3 minutes. You can change this by giving a '-t n' command-line switch, where 'n' is the time in seconds.

i. Parallel URL monitoring

urlmon can now monitor URLs in parallel by, on start up, forking off a certain number of copies of itself. Each copy will monitor a specific portion of all the URLs specified for monitoring. The number of copies to be made is specified by the '-F n' flag, where n will be the number of copies. Each copy will receieve approximately 1/n of the URLs to be monitored. The default value if no '-F' is given is to have one copy running. (Actually, no copies are forked, and the original instance does the monitoring. In this way, the behaviour is exactly the same as for previous versions.)

Experimentation is needed to see what number of parallel copies is best. The more copies, the faster the entire database will be processed. However, you don't want to occupy all the resources of the system (do you?) and, it may be less efficient to have a copy for each URL to be monitored than you might think, as having that many processes takes time and resources, and as some URLs may time out, you'll end up waiting for the time outs anyway.

New with version 4.0, I have optional support for SysV IPC (shared memory). This avoids having to have tmp files, which was ugly and a potential security problem. It is brand new and there are likely some bugs in it, especially in the code that sends the modification data from the child processes to the parent. To use it, search the body of urlmon for the line that reads:

$sysv = 0;

and change it to

$sysv = 1;

You also need the IPC::SysV perl module, available on CPAN (http://www.perl.com/CPAN/)

With SysV IPC off, urlmon will behave just as it used to, as described in the following paragraph.

Each copy of the child monitoring process writes out a file with the results of the monitoring it did. The original parent process then collects the new data, reports, deletes these temporary files, and then writes out an updated urlmonrc file. The temp filename is defined in the variable "$fork_file" (actually, there is one file for each copy, and the files are named by taking the string in "$fork_file" and tacking on a dot (".") and then the process id number of that copy). If anyone wants this settable on the let me know.

When perl gets threads I'll definitely take advantage of them here!

There is an experimental package that works with LWP to enable Parallel HTTP connections. If it also does FTP, I will rewrite urlmon to use it. This will make urlmon simpler (of course, LWP will get correspondingly more complicated). This new module is called ParallelUserAgent, and it can be found in the LWP portion of CPAN.

j. Authentication: Username/Passwd pairs

Although never mentioned, urlmon could always handle URLs of the form:

ftp://username:passwd@host.domain.org/pub/dir/file

to allow you to check non-anonymous sites. The LWP (libwww-perl) library did all this. However, the analogous form for http:

http://username:passwd@host.domain.org/pub/dir/file.html

didn't work, until now. Actually, there is partial support in LWP, in that this URL will be correctly parse, and an HTTP connection to host.domain.org will be made, but the username and password won't be used. So, I hacked at it enough to ensure that the username and password will be. I wonder if LWP behaves in this way because the HTTP protocols prohibits (or at least doesn't define) this behaviour.

(I believe LWP supports this correctly now, but I have not yet looked in to it. If so, urlmon will become even simpler (from my point of view, anyway).)

k. Saving changed files

The '-o' option...

I'm still debating removing this option. With it, all new/updated URLs will be saved to the current directory. The same thing could be done by having the output of urlmon used as input to some automated web program (like wget) to download. The problem is that this requires two network requests (and potentially two transfers of the same data) for each URL, which isn't efficient.

The reason I'm not comfortable with it 1) this is an archival function, where urlmon is for monitoring, 2) I've got a habit of putting too many features in, to the point where things get really bogged down and confusing, and 3) It's not clear what urlmon should do with the downloaded file, i.e. where should they go and what should they get named. And I'm too lazy to write the sanity checking code (make sure the directory exists that they are to be written to, make sure the file names are legal (or at least convenient) and unique, etc.)

The new format of the urlmonrc file is easily extendible (by a perl programmer, anyways), so specifying a directory for the file to be written to shouldn't be too hard. Hmm. I kind of like that idea. Maybe I'll keep this functionality after all, just to show off the benefit of the new database format. Does someone want to write the save code?

4. Idiosyncracies

All the things that I originally considered ideosyncracies back in the original release of urlmon have been removed. Anything else that could be considered an ideosyncracy has either been documented elsewhere or is a bug :)

5. Miscellaneous

MAIL ME! If you use urlmon, PLEASE PLEASE PLEASE PLEASE let me know. Send email to jdimpson@acm.org to let me know. This is the first program I've released to the world at large, and I'd be thrilled to know if anyone finds it remotely useful. Please, send me questions, comments, and suggestions.

Oh, here's the sample cron script that I run twice a day, to alert me if any URLs have changed. After it is another script sent in by Peter Mardahl. It formats the output into an HTML file so that you can browse new URLs right on the web. Thanks Peter!

#!/bin/sh

/usr/local/bin/urlmon -cl -F 5 > /tmp/urlmon.$$ # generate a curt listing

if [ -s /tmp/urlmon.$$ ]      # check to see if we have any changed URLs
then                          # send mail if we do

(echo "To: your@email.address";\
echo "From: another@email.address";\ echo "Subject: New or updated URLs"; echo "";\ cat /tmp/urlmon.$$) | /usr/lib/sendmail -t fi
rm /tmp/urlmon.$$

Date: Mon, 3 Aug 1998 10:51:55 -0700
From: Peter Mardahl
To: jdimpson@acm.org
Subject: I like your program, urlmon, thanks for writing it.

It works just fine for me.

The following shellscript will put the links into a web page instead of mailing it, a thing I find convenient:

(The web page is /accounts/peterm/public_html/changed_urls.html)

I run the shellscript in the crontab... (Actually, I run a different shellscript, one which won't splatter an existing changed_urls.html:I splatter that myself so that I don't miss changes if I don't look for n days.)

#!/bin/csh
/bin/rm -f /accounts/peterm/public_html/changed_urls.html touch /accounts/peterm/public_html/changed_urls.html /usr/local/bin/urlmon -cl -F 5 > /tmp/urlmon.$$ if ( ! -z /tmp/urlmon.$$ ) then
awk '{ printf("<A HREF=\"%s\">%s\</A>\n<br>\n",$1,$1);}' /tmp/urlmon.$$ >! /accounts/peterm/public_html/changed_urls.html endif
chmod a+r /accounts/peterm/public_html/changed_urls.html /bin/rm -f /tmp/urlmon.$$


Other Sites

Discussion Groups
  Beginners
  Distributions
  Networking / Security
  Software
  PDAs

About | FAQ | Privacy | Awards | Contact
Comments to the webmaster are welcome.
Copyright 2006 Sourcefiles.org All rights reserved.