Topic:

Tight Integration of the MPICH2 library into SGE.

Author:

Reuti, reuti__at__staff.uni-marburg.de; Philipps-University of Marburg, Germany

Version:

1.0 -- 2005-04-24 Initial release, comments and corrections are welcome

Contents:



Prerequisites

Configuration of SGE with qconf or the GUI

You should already know how to change settings in SGE, like to setup and change a queue definition or the entries in the PE configuration. Additional information about queues and parallel interfaces you can get from the man pages "queue_conf" and "sge_pe" of SGE (make sure the SGE man pages are defined in your $MANPATH).

Target platform

This Howto targets the MPICH2 version 1.0.1 on SGE 6.0 on Linux. Most likely it will work under other operating systems in the same way. Some of the commands will in this case need slight modifications. It will not work this way for MPICH2 version 1.0, as some things were only adjusted in 1.0.1, which will allow an easy Tight Integration.

MPICH2

The MPICH2 is a library from the Argonne National Laboratory (http://www.anl.gov) which is an implementation of the MPI-2 standard. Before you start with the integration of MPICH2 into SGE, you should already be familiar with the operation of MPICH2 outside of SGE and know how to compile a parallel program using MPICH2.

Included setups and scripts

The supplied archive in [1] contains the necessary scripts for the smpd startup methods, as for the gforker method no scripts are needed at all. It contains scripts and programs similar to the distribution of the PVM and MPICH integration package in SGE. For installing it for common usage in the whole cluster, you may like to untar it at $SGE_ROOT to get the new directories $SGE_ROOT/mpich2_smpd and $SGE_ROOT/mpich2_smpd_rsh.

A short program is provided in [2], which will allow you to observe the correct distribution of the spawned tasks.



Introduction to the MPICH2 family

This new MPICH2 implementation of the MPI-2 standard was created to supersede the wide used MPICH implementation. Besides implementing the MPI-2 standard, another goal was a faster startup. To give the user a greater flexibility, there are (for now) 3 startup methods implemented:

Be aware, that for each startup method and chosen way to compile them, you will get a set of mpirun and/or mpiexec for each of them. They are not interchangeable! Hence, once you installed mpd and compiled a program to run in the ring, you can't switch to smpd simply by using a different mpirun or mpiexec. Instead you have to recompile (or at least relink) your program with the intended libraries to be used with this specific startup method. This means, that you have to plan carefully your set $PATH during compilation and execution of the parallel program, to get a correct behavior. Not doing so will result in strange error messages, which will not point directly to the cause of trouble. After compiling your application software, it may be advisable not to rely on the set $PATH in your interactive shell for the submission, but to set it explicitly in the submitted script to SGE, as we will do it in this Howto for demonstration purpose. Also note, that the preferred startup command in MPICH2 is mpiexec, not mpirun.



Tight Integration of the gforker startup method

First we discuss the integration of a startup method, which is limited to one machine and hence need no network communication at all. The command line to compile MPICH2 this way is:
$ ./configure --prefix=/home/reuti/local/mpich2_gforker --with-pm=gforker

After the usual make and make install we can compile the short program which is supplied in [2] with:

$ mpicc -o mpihello mpihello.c

Although we will run only on one machine, we will use a parallel environment (PE) inside SGE, to stay conform with the idea of SGE to request more than one slot by requesting a parallel environment in the submit command. This PE may look like:

$ qconf -sp mpich2_gforker 
pe_name           mpich2_gforker
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $pe_slots
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

Remember to add this PE to a cluster queue of your choice.

$ qsub -pe mpich2_gforker 2 mpich2_gforker.sh

And with:

$ rsh node11 ps -e f -o pid,ppid,pgrp,command 
  PID  PPID  PGRP COMMAND
...
  787     1   787 /usr/sge/bin/glinux/sge_commd
  789     1   789 /usr/sge/bin/glinux/sge_execd
32146   789 32146  \_ sge_shepherd-11679 -bg
32148 32146 32148      \_ /bin/sh /var/spool/sge/node11/job_scripts/11679
32149 32148 32148          \_ mpiexec -n 2 ./mpihello
32150 32149 32148              \_ ./mpihello
32151 32149 32148              \_ ./mpihello

we already got the proper startup and Tight Integration of all started processes.



Tight Integration of the daemonless smpd startup method

To compile MPICH2 for a smpd-based startup, it must first be configured (after a "make distclean", in case you just walked through the gforker startup before):
$ ./configure --prefix=/home/reuti/local/mpich2_smpd --with-pm=smpd --with-pmi=smpd

and to get a Tight Integration we need a PE like the following (including a -catch_rsh to the start script of the PE):

$ qconf -sp mpich2_smpd_rsh
pe_name           mpich2_smpd_rsh
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh $pe_hostfile
stop_proc_args    /usr/sge/mpich2_smpd_rsh/stopmpich2.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Please lookup in the MPICH2 documentation, how to create a .smpd file with a "phrase" in it. After submitting the job in exact the same way as before (but this time taking the script mpich2_smpd_rsh.sh in the qsub command):

$ qsub -pe mpich2_smpd_rsh 4 mpich2_smpd_rsh.sh

you should see a distribution on the head node of your job like:

$ rsh node00 ps -e f -o pid,ppid,pgrp,command --cols=80
  PID  PPID  PGRP COMMAND
...
  790     1   790 /usr/sge/bin/glinux/sge_commd
  793     1   793 /usr/sge/bin/glinux/sge_execd
12198   793 12198  \_ sge_shepherd-11691 -bg
12230 12198 12230  |   \_ /bin/sh /var/spool/sge/node00/job_scripts/11691
12231 12230 12230  |       \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/11691.1.
12232 12231 12230  |           \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/1169
12233 12231 12230  |           \_ /usr/sge/bin/glinux/qrsh -inherit node00 env P
12265 12233 12230  |           |   \_ /usr/sge/utilbin/glinux/rsh -p 58120 node0
12270 12265 12230  |           |       \_ /usr/sge/utilbin/glinux/rsh -p 58120 n
12234 12231 12230  |           \_ /usr/sge/bin/glinux/qrsh -inherit node00 env P
12266 12234 12230  |           |   \_ /usr/sge/utilbin/glinux/rsh -p 58122 node0
12273 12266 12230  |           |       \_ /usr/sge/utilbin/glinux/rsh -p 58122 n
12235 12231 12230  |           \_ /usr/sge/bin/glinux/qrsh -inherit node01 env P
12267 12235 12230  |           |   \_ /usr/sge/utilbin/glinux/rsh -p 42956 node0
12276 12267 12230  |           |       \_ /usr/sge/utilbin/glinux/rsh -p 42956 n
12236 12231 12230  |           \_ /usr/sge/bin/glinux/qrsh -inherit node04 env P
12268 12236 12230  |               \_ /usr/sge/utilbin/glinux/rsh -p 59836 node0
12275 12268 12230  |                   \_ /usr/sge/utilbin/glinux/rsh -p 59836 n
12261   793 12261  \_ sge_shepherd-11691 -bg
12262 12261 12262  |   \_ /usr/sge/utilbin/glinux/rshd -l
12269 12262 12269  |       \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg
12283 12269 12283  |           \_ /home/reuti/mpihello
12263   793 12263  \_ sge_shepherd-11691 -bg
12264 12263 12264      \_ /usr/sge/utilbin/glinux/rshd -l
12271 12264 12271          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg
12284 12271 12284              \_ /home/reuti/mpihello

The important thing is, that the started script including the mpiexec and the program mpihello are under full SGE control.

(Side note: the default command compiled into MPICH2 this way is "ssh -x". You may replace this by changing in the MPICH2 source $MPICH2_ROOT/src/pm/smpd/mpiexec_rsh.c in the routine mpiexec_rsh() the default value "ssh -x" to a plain "rsh", or change it each time during execution of your application program by setting the environment variable "MPIEXEC_RSH=rsh; export MPIEXEC_RSH" to get access to a the rsh-wrapper, like in the original MPICH implementation.)



Tight Integration of the daemon-based smpd startup method

Similar to the PVM-Integration, we need a small helping program to start the daemons as a task on the slave nodes using the qrsh-command. In some way, this start_mpich2 can be seen as a generic program extending SGE with the ability to run a qrsh-command in the background, which can easily be modified for similar startup methods.

If you installed the whole package like suggested in $SGE_ROOT/mpich2_smpd, set the working directory to $SGE_ROOT/mpich2_smpd/src and compile the included program with:

$ ./aimk
$ ./install.sh

The installation process will put the helping program mpich2_smpd in a created directory $SGE_ROOT/mpich2_smpd/bin, which is the default location of the included script startmpich2.sh to look for this program. A parallel environment for this startup method may look like:

$ qconf -sp mpich2_smpd
pe_name           mpich2_smpd
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/sge/mpich2_smpd/startmpich2.sh -catch_rsh $pe_hostfile /home/reuti/local/mpich2_smpd
stop_proc_args    /usr/sge/mpich2_smpd/stopmpich2.sh -catch_rsh /home/reuti/local/mpich2_smpd
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

If we start the daemons on our own, we have to select a free port. Although it maybe not safe in all cluster setups, the included formula in startmpich2.sh, stopmpich2.sh and the demonstration submit script mpich2_smpd.sh uses "$JOB_ID MOD 5000 + 20000" for the port. Depending on your job turnaround in your cluster, you may modify it in all locations where it's defined. To force the smpds not to fork themselves into daemon land, they are started with the additional parameter "-d 0". According to the MPICH2 team, this will not have any speed impact (because the level of debugging is set to 0), but only prevent the daemons from forking. Having this setup in a proper way, we can submit the demonstration job:

$ qsub -pe mpich2_smpd 2 mpich2_smpd.sh

and observe the distributed tasks on the nodes, after looking at the selected nodes:

$ qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID 
---------------------------------------------------------------------------------------------
  11804     0 mpich2_smp reuti        r     04/22/2005 01:23:29 para03     MASTER         
            0 mpich2_smp reuti        r     04/22/2005 01:23:29 para03     SLAVE          
  11804     0 mpich2_smp reuti        r     04/22/2005 01:23:29 para04     SLAVE 

On the head node of the MPICH2 job, a process distribution like the following can be observed:

$ rsh node03 ps -e f -o pid,ppid,pgrp,command --cols=80
  PID  PPID  PGRP COMMAND
...
  789     1   789 /usr/sge/bin/glinux/sge_commd
  792     1   792 /usr/sge/bin/glinux/sge_execd
 4125   792  4125  \_ sge_shepherd-11804 -bg
 4206  4125  4206  |   \_ /bin/sh /var/spool/sge/node03/job_scripts/11804
 4207  4206  4206  |       \_ mpiexec -n 2 -machinefile /tmp/11804.1.para03/mach
 4174   792  4174  \_ sge_shepherd-11804 -bg
 4175  4174  4175      \_ /usr/sge/utilbin/glinux/rshd -l
 4178  4175  4178          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg
 4190  4178  4190              \_ /home/reuti/local/mpich2_smpd/bin/smpd -port 2
 4208  4190  4190                  \_ /home/reuti/local/mpich2_smpd/bin/smpd -po
 4209  4208  4190                      \_ /home/reuti/mpihello
 4152     1  4126 /usr/sge/bin/glinux/qrsh -inherit node03 /home/reuti/local/mpi
 4177  4152  4126  \_ /usr/sge/utilbin/glinux/rsh -p 35024 node03 exec '/usr/sge
 4179  4177  4126      \_ [rsh <defunct>]
 4154     1  4126 /usr/sge/bin/glinux/qrsh -inherit node04 /home/reuti/local/mpi
 4185  4154  4126  \_ /usr/sge/utilbin/glinux/rsh -p 35871 node04 exec '/usr/sge
 4187  4185  4126      \_ [rsh <defunct>]

The forked-off qrsh-commands by the startmpich2.sh (and start_mpich2 program) are no longer bound to the starting script in start_proc_args, but they are not consuming any CPU time or need to be shut down during a qdel (they are just waiting for the shutdown of the spawned daemons on the slave nodes). Important is, that the working tasks of the mpihello are bound to the process chain, so that the accounting will be correct, and also a controlled shutdown of the daemons is possible. To give some feedback to the user of the started tasks, the *.po<jobid> file will contain the check of the started MPICH2 universe:

$ cat mpich2_smpd.sh.po11804
-catch_rsh /var/spool/sge/node03/active_jobs/11804.1/pe_hostfile /home/reuti/local/mpich2_smpd
node03
node04
startmpich2.sh: check for smpd daemons (1 of 10)
/usr/sge/bin/glinux/qrsh -inherit node04 /home/reuti/local/mpich2_smpd/bin/smpd -port 21804 -d 0
/usr/sge/bin/glinux/qrsh -inherit node03 /home/reuti/local/mpich2_smpd/bin/smpd -port 21804 -d 0
startmpich2.sh: missing smpd on node03
startmpich2.sh: missing smpd on node04
startmpich2.sh: check for smpd daemons (2 of 10)
startmpich2.sh: found running smpd on node03
startmpich2.sh: found running smpd on node04
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /home/reuti/local/mpich2_smpd

If all is running fine, you may comment out these lines to shorten the output a little bit and avoid any confusion to the user. Depending of your personal taste, you may put the definition of your MPICH2 path in a file like .bashrc, which will be sourced during a non-interactive login.



Nodes with more than one network interface

For now, there is no way to direct the usage of a secondary network interface to MPICH2 (except for running under Windows). So it's advisable to prepare a network setup of the cluster, where the primary interface is the one to be used for the MPICH2 communication.


References and Documents

SGE-MPICH2 Integration

[1] Archive with all the scripts used in this Howto: mpich2-60.tgz [for older installations using SGE 5.3: mpich2-53.tgz].

[2] Archive with a small MPICH2 program to check the correct distribution of all the tasks: mpihello.tgz.

MPICH2

The latest version of MPICH2 and build instructions can be downloaded from (http://www-unix.mcs.anl.gov/mpi/mpich2).

MPI documentation in general and tutorials

For some general introduction to MPI and MPI-Programming, you can study the following documents: