Topic:
Tight Integration of the MPICH2 library into SGE.
Author:
Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany
Version:
1.0 -- 2005-04-24 Initial release, comments and corrections are welcome
Contents:
Configuration of SGE with qconf or the GUIYou should already know how to change settings in SGE, like to setup and change a queue definition or the entries in the PE configuration. Additional information about queues and parallel interfaces you can get from the man pages "queue_conf" and "sge_pe" of SGE (make sure the SGE man pages are defined in your $MANPATH).
Target platform
This Howto targets the MPICH2 version 1.0.1 on SGE 6.0 on Linux. Most likely it will work under other operating systems in the same way. Some of the commands will in this case need slight modifications. It will not work this way for MPICH2 version 1.0, as some things were only adjusted in 1.0.1, which will allow an easy Tight Integration.
MPICH2
The MPICH2 is a library from the Argonne National Laboratory (http://www.anl.gov) which is an implementation of the MPI-2 standard. Before you start with the integration of MPICH2 into SGE, you should already be familiar with the operation of MPICH2 outside of SGE and know how to compile a parallel program using MPICH2.
Included setups and scripts
The supplied archive in [1] contains the necessary scripts for the smpd startup methods, as for the gforker method no scripts are needed at all. It contains scripts and programs similar to the distribution of the PVM and MPICH integration package in SGE. For installing it for common usage in the whole cluster, you may like to untar it at $SGE_ROOT to get the new directories $SGE_ROOT/mpich2_smpd and $SGE_ROOT/mpich2_smpd_rsh.
A short program is provided in [2], which will allow you to observe the correct distribution of the spawned tasks.
This new MPICH2 implementation of the MPI-2 standard was created to supersede the wide used MPICH implementation. Besides implementing the MPI-2 standard, another goal was a faster startup. To give the user a greater flexibility, there are (for now) 3 startup methods implemented:
- mpd: As the primary startup method mpd is introduced by MPICH2. It's based on the script language Python to startup a so called ring of machines. Giving mpd a list of nodes will startup daemons on the requested machines, which can be used immediately for the execution of parallel programs inside this ring. This is convenient for the interactive use of a parallel program, as the only thing which must be prepared is a list of to be used nodes.
There exists no way to convince MPICH2, not to create many new processgroups on each machine. As a limitation to one processgroup on each machine is essential for achieving a Tight Integration into SGE on all platforms, this startup method is not discussed in this Howto (although a Tight Integration is possible on platforms, where the additional group ID is used to take control of all slave tasks).
- smpd: This startup method can be used in a daemon based or daemonless mode. The daemon based startup is not creating all the daemons on the nodes according to a nodelist on its own (like it is done by the mpd startup method), but the daemons have to be started before the execution of the main program, e.g. by a script. In this Howto, a startup of the daemons will be presented where they are started by the given procedure to start_proc_args.
A daemonless startup is very similar to the startup of the tasks in the former MPICH. Although it includes the same scripts from the original $SGE_ROOT/mpi, it's included here (with some editing to the templates), so that it can easily be used with a still installed $SGE_ROOT/mpi without any side effects.
- gforker. Programs started under gforker are limited to one machine and supports only forks for additional processes.
Be aware, that for each startup method and chosen way to compile them, you will get a set of mpirun and/or mpiexec for each of them. They are not interchangeable! Hence, once you installed mpd and compiled a program to run in the ring, you can't switch to smpd simply by using a different mpirun or mpiexec. Instead you have to recompile (or at least relink) your program with the intended libraries to be used with this specific startup method. This means, that you have to plan carefully your set $PATH during compilation and execution of the parallel program, to get a correct behavior. Not doing so will result in strange error messages, which will not point directly to the cause of trouble. After compiling your application software, it may be advisable not to rely on the set $PATH in your interactive shell for the submission, but to set it explicitly in the submitted script to SGE, as we will do it in this Howto for demonstration purpose. Also note, that the preferred startup command in MPICH2 is mpiexec, not mpirun.
First we discuss the integration of a startup method, which is limited to one machine and hence need no network communication at all. The command line to compile MPICH2 this way is:$ ./configure --prefix=/home/reuti/local/mpich2_gforker --with-pm=gforkerAfter the usual make and make install we can compile the short program which is supplied in [2] with:
$ mpicc -o mpihello mpihello.cAlthough we will run only on one machine, we will use a parallel environment (PE) inside SGE, to stay conform with the idea of SGE to request more than one slot by requesting a parallel environment in the submit command. This PE may look like:
$ qconf -sp mpich2_gforker pe_name mpich2_gforker slots 999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots minRemember to add this PE to a cluster queue of your choice.
$ qsub -pe mpich2_gforker 2 mpich2_gforker.shAnd with:
$ rsh node11 ps -e f -o pid,ppid,pgrp,command PID PPID PGRP COMMAND ... 787 1 787 /usr/sge/bin/glinux/sge_commd 789 1 789 /usr/sge/bin/glinux/sge_execd 32146 789 32146 \_ sge_shepherd-11679 -bg 32148 32146 32148 \_ /bin/sh /var/spool/sge/node11/job_scripts/11679 32149 32148 32148 \_ mpiexec -n 2 ./mpihello 32150 32149 32148 \_ ./mpihello 32151 32149 32148 \_ ./mpihellowe already got the proper startup and Tight Integration of all started processes.
To compile MPICH2 for a smpd-based startup, it must first be configured (after a "make distclean", in case you just walked through the gforker startup before):$ ./configure --prefix=/home/reuti/local/mpich2_smpd --with-pm=smpd --with-pmi=smpdand to get a Tight Integration we need a PE like the following (including a -catch_rsh to the start script of the PE):
$ qconf -sp mpich2_smpd_rsh pe_name mpich2_smpd_rsh slots 999 user_lists NONE xuser_lists NONE start_proc_args /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh $pe_hostfile stop_proc_args /usr/sge/mpich2_smpd_rsh/stopmpich2.sh allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots minPlease lookup in the MPICH2 documentation, how to create a .smpd file with a "phrase" in it. After submitting the job in exact the same way as before (but this time taking the script mpich2_smpd_rsh.sh in the qsub command):
$ qsub -pe mpich2_smpd_rsh 4 mpich2_smpd_rsh.shyou should see a distribution on the head node of your job like:
$ rsh node00 ps -e f -o pid,ppid,pgrp,command --cols=80 PID PPID PGRP COMMAND ... 790 1 790 /usr/sge/bin/glinux/sge_commd 793 1 793 /usr/sge/bin/glinux/sge_execd 12198 793 12198 \_ sge_shepherd-11691 -bg 12230 12198 12230 | \_ /bin/sh /var/spool/sge/node00/job_scripts/11691 12231 12230 12230 | \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/11691.1. 12232 12231 12230 | \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/1169 12233 12231 12230 | \_ /usr/sge/bin/glinux/qrsh -inherit node00 env P 12265 12233 12230 | | \_ /usr/sge/utilbin/glinux/rsh -p 58120 node0 12270 12265 12230 | | \_ /usr/sge/utilbin/glinux/rsh -p 58120 n 12234 12231 12230 | \_ /usr/sge/bin/glinux/qrsh -inherit node00 env P 12266 12234 12230 | | \_ /usr/sge/utilbin/glinux/rsh -p 58122 node0 12273 12266 12230 | | \_ /usr/sge/utilbin/glinux/rsh -p 58122 n 12235 12231 12230 | \_ /usr/sge/bin/glinux/qrsh -inherit node01 env P 12267 12235 12230 | | \_ /usr/sge/utilbin/glinux/rsh -p 42956 node0 12276 12267 12230 | | \_ /usr/sge/utilbin/glinux/rsh -p 42956 n 12236 12231 12230 | \_ /usr/sge/bin/glinux/qrsh -inherit node04 env P 12268 12236 12230 | \_ /usr/sge/utilbin/glinux/rsh -p 59836 node0 12275 12268 12230 | \_ /usr/sge/utilbin/glinux/rsh -p 59836 n 12261 793 12261 \_ sge_shepherd-11691 -bg 12262 12261 12262 | \_ /usr/sge/utilbin/glinux/rshd -l 12269 12262 12269 | \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg 12283 12269 12283 | \_ /home/reuti/mpihello 12263 793 12263 \_ sge_shepherd-11691 -bg 12264 12263 12264 \_ /usr/sge/utilbin/glinux/rshd -l 12271 12264 12271 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg 12284 12271 12284 \_ /home/reuti/mpihelloThe important thing is, that the started script including the mpiexec and the program mpihello are under full SGE control.
(Side note: the default command compiled into MPICH2 this way is "ssh -x". You may replace this by changing in the MPICH2 source $MPICH2_ROOT/src/pm/smpd/mpiexec_rsh.c in the routine mpiexec_rsh() the default value "ssh -x" to a plain "rsh", or change it each time during execution of your application program by setting the environment variable "MPIEXEC_RSH=rsh; export MPIEXEC_RSH" to get access to a the rsh-wrapper, like in the original MPICH implementation.)
Similar to the PVM-Integration, we need a small helping program to start the daemons as a task on the slave nodes using the qrsh-command. In some way, this start_mpich2 can be seen as a generic program extending SGE with the ability to run a qrsh-command in the background, which can easily be modified for similar startup methods.If you installed the whole package like suggested in $SGE_ROOT/mpich2_smpd, set the working directory to $SGE_ROOT/mpich2_smpd/src and compile the included program with:
$ ./aimk $ ./install.shThe installation process will put the helping program mpich2_smpd in a created directory $SGE_ROOT/mpich2_smpd/bin, which is the default location of the included script startmpich2.sh to look for this program. A parallel environment for this startup method may look like:
$ qconf -sp mpich2_smpd pe_name mpich2_smpd slots 999 user_lists NONE xuser_lists NONE start_proc_args /usr/sge/mpich2_smpd/startmpich2.sh -catch_rsh $pe_hostfile /home/reuti/local/mpich2_smpd stop_proc_args /usr/sge/mpich2_smpd/stopmpich2.sh -catch_rsh /home/reuti/local/mpich2_smpd allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots minIf we start the daemons on our own, we have to select a free port. Although it maybe not safe in all cluster setups, the included formula in startmpich2.sh, stopmpich2.sh and the demonstration submit script mpich2_smpd.sh uses "$JOB_ID MOD 5000 + 20000" for the port. Depending on your job turnaround in your cluster, you may modify it in all locations where it's defined. To force the smpds not to fork themselves into daemon land, they are started with the additional parameter "-d 0". According to the MPICH2 team, this will not have any speed impact (because the level of debugging is set to 0), but only prevent the daemons from forking. Having this setup in a proper way, we can submit the demonstration job:
$ qsub -pe mpich2_smpd 2 mpich2_smpd.shand observe the distributed tasks on the nodes, after looking at the selected nodes:
$ qstat job-ID prior name user state submit/start at queue master ja-task-ID --------------------------------------------------------------------------------------------- 11804 0 mpich2_smp reuti r 04/22/2005 01:23:29 para03 MASTER 0 mpich2_smp reuti r 04/22/2005 01:23:29 para03 SLAVE 11804 0 mpich2_smp reuti r 04/22/2005 01:23:29 para04 SLAVEOn the head node of the MPICH2 job, a process distribution like the following can be observed:
$ rsh node03 ps -e f -o pid,ppid,pgrp,command --cols=80 PID PPID PGRP COMMAND ... 789 1 789 /usr/sge/bin/glinux/sge_commd 792 1 792 /usr/sge/bin/glinux/sge_execd 4125 792 4125 \_ sge_shepherd-11804 -bg 4206 4125 4206 | \_ /bin/sh /var/spool/sge/node03/job_scripts/11804 4207 4206 4206 | \_ mpiexec -n 2 -machinefile /tmp/11804.1.para03/mach 4174 792 4174 \_ sge_shepherd-11804 -bg 4175 4174 4175 \_ /usr/sge/utilbin/glinux/rshd -l 4178 4175 4178 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg 4190 4178 4190 \_ /home/reuti/local/mpich2_smpd/bin/smpd -port 2 4208 4190 4190 \_ /home/reuti/local/mpich2_smpd/bin/smpd -po 4209 4208 4190 \_ /home/reuti/mpihello 4152 1 4126 /usr/sge/bin/glinux/qrsh -inherit node03 /home/reuti/local/mpi 4177 4152 4126 \_ /usr/sge/utilbin/glinux/rsh -p 35024 node03 exec '/usr/sge 4179 4177 4126 \_ [rsh <defunct>] 4154 1 4126 /usr/sge/bin/glinux/qrsh -inherit node04 /home/reuti/local/mpi 4185 4154 4126 \_ /usr/sge/utilbin/glinux/rsh -p 35871 node04 exec '/usr/sge 4187 4185 4126 \_ [rsh <defunct>]The forked-off qrsh-commands by the startmpich2.sh (and start_mpich2 program) are no longer bound to the starting script in start_proc_args, but they are not consuming any CPU time or need to be shut down during a qdel (they are just waiting for the shutdown of the spawned daemons on the slave nodes). Important is, that the working tasks of the mpihello are bound to the process chain, so that the accounting will be correct, and also a controlled shutdown of the daemons is possible. To give some feedback to the user of the started tasks, the *.po<jobid> file will contain the check of the started MPICH2 universe:
$ cat mpich2_smpd.sh.po11804 -catch_rsh /var/spool/sge/node03/active_jobs/11804.1/pe_hostfile /home/reuti/local/mpich2_smpd node03 node04 startmpich2.sh: check for smpd daemons (1 of 10) /usr/sge/bin/glinux/qrsh -inherit node04 /home/reuti/local/mpich2_smpd/bin/smpd -port 21804 -d 0 /usr/sge/bin/glinux/qrsh -inherit node03 /home/reuti/local/mpich2_smpd/bin/smpd -port 21804 -d 0 startmpich2.sh: missing smpd on node03 startmpich2.sh: missing smpd on node04 startmpich2.sh: check for smpd daemons (2 of 10) startmpich2.sh: found running smpd on node03 startmpich2.sh: found running smpd on node04 startmpich2.sh: got all 2 of 2 nodes -catch_rsh /home/reuti/local/mpich2_smpdIf all is running fine, you may comment out these lines to shorten the output a little bit and avoid any confusion to the user. Depending of your personal taste, you may put the definition of your MPICH2 path in a file like .bashrc, which will be sourced during a non-interactive login.
For now, there is no way to direct the usage of a secondary network interface to MPICH2 (except for running under Windows). So it's advisable to prepare a network setup of the cluster, where the primary interface is the one to be used for the MPICH2 communication.
SGE-MPICH2 Integration[1] Archive with all the scripts used in this Howto: mpich2-60.tgz [for older installations using SGE 5.3: mpich2-53.tgz].
[2] Archive with a small MPICH2 program to check the correct distribution of all the tasks: mpihello.tgz.
MPICH2
The latest version of MPICH2 and build instructions can be downloaded from (http://www-unix.mcs.anl.gov/mpi/mpich2).
MPI documentation in general and tutorials
For some general introduction to MPI and MPI-Programming, you can study the following documents:
- http://www.mpi-forum.org/docs
- http://www.netlib.org/utk/papers/mpi-book/mpi-book.html
- http://www-unix.mcs.anl.gov/mpi/usingmpi/index.html
- http://www-unix.mcs.anl.gov/mpi/usingmpi2
- ftp://math.usfca.edu/pub/MPI/mpi.guide.ps
- http://www.science.uva.nl/research/scs/edu/pscs/guide.pdf
- http://www.science.uva.nl/research/scs/edu/distr/guide_to_the_practical_work.pdf