Inspired by the documentation legacy of Dr. Desai, I've created some
examples for how to use condor with IDL and Pyraf/IRAF.
Introduction
The concept behind condor is simple, take the processors that are
currently being unused and assign them jobs from a queue. In
practice, condor is like all free Linux tools--unintuitive syntax with
lousy documentation that make it a pain to learn. But, the
ability to start running on a large number of processors makes it
worthwhile to learn how to use it. Hopefully these examples will
let people dive right into the condor pool without drowning.
The code
everything you need is condorexamples.tar.
Download and untar that somewhere and you should have working examples
of IDL and Pyraf condor jobs.
IDL example
The IDL_exmpls directory contains several files:
1) happy.pro
a very simple IDL procedure that will get called
with different arguments
2) con_script
an executable file that sets up the environment
variables that IDL needs, writes an idl-batch file, then executes the
batch file. I'm sure there is a more elegant way to do this, when
you figure it out let me know (or just make a nice new example folder
to share).
3) example.cfg
the cfg file that you submit to condor. Note
you need to modify this file to reflect the path where you put the
files (If you submit it without modification, it will try to run in MY
directory, and I'm setting permissions so that won't happen again Nate).
In order to try it out, modify the .cfg file, then at the command line:
>condor_submit example.cfg
check on the progress with:
>condor_q
(try the -global flag for extra fun)
when things go bad, you can kill all the condor jobs with
>condor_rm -all
After condor runs the jobs, you should have files *.out, *.err, *.log,
*.idl.batch, and happy*.sav. look at the .log, .err, and .out
files, they should be pretty easy to understand. The
happy*.sav files are IDL save files, you can look at them in IDL using
the restore procedure.
Pyraf example
The Pyraf_exmpl directory has an example of a python executable that
loads up the Pyraf library to execute an IRAF routine. The files
are as follows:
1) *.fits
Two longslit spectra files I had lying around that
will be sky-subtracted with condor.
2) con_back.py
python script that takes 2 arguments (input filename
and output filename), loads up the pyraf scripts, sets the task
parameters, then executes the background sky subtraction.
3) back_wrap
the executable that sets up all the environment
variables. Note that it needs to be adjusted to point to YOUR
home directory instead of mine.
4) back.cfg
config file for condor. Again, this file needs
to be modified so that the paths are to your directories and not
mine. There is a line that that forces condor to run these jobs
on good machines (memory >= 750). I have tested running on the
undergrad machines and it kind of works--the problem is they cannot yet
see /net/mega-1, and net/grads-1 is down right now. But in
general, I THINK, it' s better to run with as few requirements as
possible. There are some jobs that truly do need the fast P4's
with lots of memory to run. If you have a lot of jobs, and they
can run on the undergrad machines, do not set any memory requirements
and let them run over there. Some people have complained that if
they do not have much memory on their machine, it takes a long time for
condor jobs to clear off and let them start working. Screw
'um--if they want to be in the condor pool, they should go write a
grant and get a modern machine, the rest of us don't have time to dink
around in our code to exclude their one lousy machine so they can surf
for porn faster.
To run, fix the paths in back_wrap and back.cfg and:
>condor_submit back.cfg
These jobs will take a while to run--but you can open the resulting
fits files with ds9 at any point to view the progress. Note these
are close to the limit of what you would want to run without
checkpointing. I have no idea how you could make an iraf job
checkpoint, but its a good thing to keep in mind.
Common Pitfalls
Here are some problems I've encountered in the past. Of course I
don't expect this to prevent other people from making the same
mistakes, but it might help speed along some debugging.
- The executable file has to have its permissions set to be
executable.
- All the files your working with need to be visible to all the
machines
on the network (in the olden days, home directories were not always
visible everywhere). It would be good if all the code and
everything
it needed was in one place--but I have IDL and IRAF code in my home
directory, net/grads-1, mega, etc. So do as I say, yadda yadda.
- The undergrad machines are in the condor pool, but do not yet
recognize
paths that start with /astro. The easy way around this is to just
lop
off the /astro from all your paths (as done in the example code
here).
That way, your code can run on either an undergrad machine or a grad
student machine. They also don't seem to see mega-1/ yet,
hopefully
they will soon.
- Set up environment variables. Condor doesn't have any
environment
variables set by default, so for IDL and Pyraf they need to be set
manually.
- Be careful with running lots of IDL jobs--the department has a
limited
number of IDL licenses. If you tell condor to run 100 IDL
processes,
it will try and fail because we don't have 100 . Look at the
pyraf
example for a nice memory requirement line that should keep the number
of machines you use at once low.
- Always check the progress of your jobs with condor_q and look
at the
error and log files. You don't want to waste cycles or lock up
other
resources.
- Try to make sure the jobs can complete in a reasonable
time. If a job
takes 10 hours to run, odds are good it will get booted off the machine
before it completes, then it will start over on another processor, run
for a few hours, get booted, and you end up using a lot of processor
time and not getting any results. OR
- Make code that can checkpoint. I can think of a few ways to
set-up IDL
code such that it essentially checkpoints (just use IDL's save and
restore procedures in clever ways)--but it would be nice if someone was
able to write a robust procedure that utilized Condor's own
checkpointing features as well. I have no idea how one would
checkpoint with Pyraf, but most IRAF tasks should not take more than an
hour anyway.
PY, 7/05