esky - A slightly portable checkpointer
David Gibson
esky is a system for checkpointing and resuming Unix processes. That is, esky can take a running process and "freeze" its state to a checkpoint file. Later it can "thaw" the checkpoint, resuming the process from the point at which it was frozen. The resume can be performed after a reboot, or on a another (sufficiently similar) machine.
esky works under either Linux 2.2 + glibc 2.1 or Solaris 2.6 (with its standard libc). gcc is required to compile esky. You will also need the GNU linker to build esky, but gcc need not be set up to use GNU ld by default. Most of esky is compiled with the normal system linker, but GNU ld is invoked explicitly to compile the static agent.
esky also requires glib-1.2 which can be obtained from ftp://ftp.gtk.org/pub/gtk/v1.2/glib-1.2.5.tar.gz (it's also included in many Linux distributions).
WARNING: Consider this to be experimental code, alpha level at best.
The directories are as follows:
doc/ Various documents about esky. These may or may not be up-to-date and/or accurate.
src/ The source code. The interesting bit.
test/ Various scraps of experimental code. Probably not very interesting.
Capabilities and limitations
esky can only freeze and thaw sufficiently well-behaved programs. Some of the major current limitations of esky are the following:
- Programs must be dynamically linked
- Programs may not use fork() or exec() (that includes things like system() or popen()). ie. esky will only freeze a single process.
- Programs must be single threaded
- Files used by the checkpointed program must not be altered between checkpoint and resume time (it might be OK to modify files in some cases, see below)
This list is not exhaustive.
Some things that esky can handle (ie. will save and restore) are the following:
- Shared libraries, either loaded automatically or explicitly with dlopen().
- mmap()s both shared and private.
- Opened regular files and directories
- Opened devices that have no state
- Opened devices that have no state except for file pointer position.
- Opened doors (under Solaris) that have no state. The name service door seems to be an example.
- Current working directory
- Signal handlers
For more information see the files in the 'doc' subdirectory. The stuff here is not guaranteed to be up-to-date or very accurate. For authoritative information, see the source code.
Building esky
Starting in the directory you found this README, use the following commands to build esky.
mkdir obj
cd obj
../src/configure
make
Any directory name of your choice can be substituted for 'obj'
There are some options you might need to pass to configure:
--with-agent-ld=/path/to/gnu/ld must be used when GNU ld is not the normal system linker.
--with-glib-prefix=/path/to/glib is used when glib is not installed in the standard library search paths.
Also note that if you are using a dynamic version of glib, the glib shared libraries will need to be on your library search path when the configure script runs.
e.g. To build esky for Solaris I use the commands:
mkdir obj-solaris
cd obj-solaris
export LD_LIBRARY_PATH=/data/dgibson/gnu-root/lib
../src/configure --with-agent-ld=/data/dgibson/gnu-root/bin/gnu-ld \
--with-glib-prefix=/data/dgibson/gnu-root
make
Installing esky
There is no 'make install' target. Given its experimental nature, I wouldn't recommend installing esky at all. Just manually place the binaries wherever is convenient. There are three files you need to worry about:
'esky' - The monitor program. Just put it somewhere it's convenient to run it from.
'esky_lib.so' - the shared library. This must be in your dynamic library path. Set the LD_LIBRARY_PATH environment variable to include the directory it's in.
'esky_agent.bin' - the static agent image. By default, esky looks for this file in the current directory. You can override this by giving its full pathname in the environment variable ESKY_AGENT.
Using esky
To checkpoint a process it must be run under the control of the esky monitor. To start a program under esky's supervision use the command:
esky start program-name [arguments]
To cause the process to be checkpointed, use 'kill' to send a SIGUSR1 to the 'esky' process (not the process to be checkpointed itself). A (large) file called 'checkpoint' will be created in the current directory.
To resume the process, run
esky resume <program> <arguments> The <arguments> to the program are not important here, but should make the process appear as expected in a ps listing.
Contact information
esky was written by David Gibson working at the Australian National University in Canberra, Australia.
email: esky@gibson.dropbear.id.au
esky can be obtained from
ftp://cap.anu.edu.au/pub/dgibson/esky/
