updated:
This is a test version of an SGE and LAM MPI intergration package. Its only been minimally tested with MPI codes. Full LAM functionality was not yet tested as of 7/22/03. This release is intended to get the code into the hands of users for testing this initial release and providing early feedback.
This code was tested against SGE 5.3p2 and LAM 6.5.9 on Solaris
8.
Download Here: sge-lam.tar
For
updates and info regarding this code email:
christopher.duncan@sun.com
This
code is provided AS-IS with no implied warranty or support.
README.sge-lam |
Directions and info |
qrsh-lam |
qrsh wrapper used for remote lamboot and for local lamd |
sge-lam |
SGE compatible lamboot and lamhalt for use in start_proc_args and stop_proc_ags for and SGE PE |
Install LAM MPI and SGE. This code was tested against SGE
5.3p2 and LAM 6.5.9 and should work with later releases. It may work
with earlier versions of SGE and LAM.
NOTE: make sure
your shell startup env has both the LAM and SGE bin dirs in your
path.
Install the 2 PERL executables: qrsh-lam
,
sge-lam
inside the LAM installation bin dir.
Make sure they are executable.
Modify the variables: LAMBINDIR
and
SGEBINDIR
in sge-lam
and
qrsh-lam
to fit your site setup. The variables will
depend on your installation of SGE and LAM.
Create an SGE PE that can be used to submit lam jobs. The
following is an example assuming the scripts exist in
/usr/local/lam/bin
. You should replace the queue_list
and slots with your site specific values.
% qconf -sp lammpi pe_name lammpi queue_list hpc-v880.q polarbear.q slots 6 user_lists NONE xuser_lists NONE start_proc_args /usr/local/lam/bin/sge-lam start stop_proc_args /usr/local/lam/bin/sge-lam stop allocation_rule $fill_up control_slaves TRUE job_is_first_task TRUE
NOTE: It is probably easiest to use the qmon GUI to create the PE.
Modify your LAM boot schema to use qrsh-lam. This is normally
in the file $LAMHOME/etc/lam-conf.lam
. You need to give
a path to qrsh-lam and lamd for the boot schema. Normally this would
be something like:
lamd $inet_topo
$debug
instead change this to (assuming
your LAMBINDIR
is
/usr/local/lam/bin
): /usr/local/lam/bin/qrsh-lam
local /usr/local/lam/bin/lamd $inet_topo $debug
With this PE setup users can submit jobs as normal and do not need to lamboot on their own. Users need only call mpirun for their MPI programs. Here is an example job:
% cat lamjob.csh #$ -cwd set path=(/usr/local/lam/bin $path) echo "Starting my LAM MPI job" mpirun C conn-60 echo "LAM MPI job done"
Using the C arg to mpirun is the easiest way to create a spanning MPI
job that uses all the allocated
slots for MPI.
A single user running multiple LAM jobs at once on the same nodes will have problems with accounting at minimum and may also improperly be halted when the first job exits. This limitation may be removed in future revs by using LAM_MPI_SOCKET_SUFFIX.
There have been some path issues with some shell setups and lamboot operating properly when qrsh;ing to remote nodes. This can normally be alleviated by ensuring your shell path includes the LAM bin dir.
Can we get rid of the lam-conf.lam
changes by
directly using hboot or something similar?
Modify qrsh-lam
to be able to find its qrsh
from SGE_ROOT
. (for use in heterogenous clusters)
installer script to automate steps 1-6 after asking for
LAMHOME
and SGE_ROOT
? also ask for a PE.