You must use the batch system to run computations on the cluster. Here you will find tutorials and information on using the batch system.
In brief, the batch software is the TORQUE resource manager running in conjunction with the Maui scheduler. There are three queues configured to handle jobs on either the standard or big-memory compute nodes. Unless you understand the differences between the queues, you should in general stick to submitting jobs to the “normal” queue.
| Batch Name | Notes |
|---|---|
| normal | for jobs on regular computing nodes (n1-n29); job resources limited to 8 nodes and 1 week of wall time; default wall time is 2 days |
| scavenger | for big jobs on regular computing nodes (n1-n29); job resources limited to 1 week of wall time (same as default); low priority compared to normal queue |
| bigmem | for memory- or communication-intensive jobs on bn1 and bn2; job resources limited to 1 week of wall time; default wall time is 2 days |
The resource limits on the queues are intended to be generous. In general, under crowded conditions on the cluster, courtesy dictates that you use much less than the allowed resource limits. In a very long computation, this would involve for example having your code checkpoint and terminate itself after a few hours, and resubmit itself to the batch system as a new job to continue the calculation. If you think you need to run a job that exceeds the allowed resource limits, you should consult management.
Below you'll find tutorials and examples to cover most useful situations you'll encounter on the cluster. If you are new to batch systems and you need to run parallel jobs, you should read thoroughly all these sections except maybe for the last one.
A cluster is a lot like a crowded restaurant. To do what you need to do, you have to find a suitably big spot, and then claim it for as long as you need it. In either the cluster or the restaurant, this can become a difficult task if the situation is crowded and the available spots are widely scattered, particularly if you're arriving with a big group of parallel threads/people. On the cluster, for example, you can use the ruptime command to check the cpu load on each node, and find some nodes that are running at less than full capacity. However, this strategy is far from foolproof: the cpu load could be artifically depressed while someone's job is exiting and respawning, or simply bogged down in disk-intensive activities.
The ideal thing is to have a sort of maitre d' on hand, who keeps track of where people are running jobs and what nodes are available. Better yet is if the maitre d' kept a waiting list in the case where the cluster is running at full capacity, so when some nodes free up your calculation could automatically start. This is precisely the function of the batch system. The batch system is one of the main elements that ties the cluster together into a single, coherent entity, rather than just a bunch of computers.
The simplest way to gain access to the compute nodes is to spawn an interactive job, which is essentially the same thing as logging in to a compute node via ssh. (Direct logins to compute nodes are not permitted on the cluster except via the batch system.) This is appropriate for, say, short debugging and testing runs.
The basic command for requests to the batch system is qsub. The simplest request for an interactive batch session is
qsub -I
After executing this command on qc.uoregon.edu, you should get a response like:
qsub: waiting for job 75.qcprivate.cluster to start qsub: job 75.qcprivate.cluster ready n29 1 [Thu 09 8:13pm] >
This means the batch system has assigned the number 75 to this batch request, and started a session on one of the cluster nodes (in this case, n29). In the absence of a more specific request, the batch system has assigned a single cpu on a single node, and a time limit of 2 days to this session. Now you can run codes, compile things, etc. When you type “exit” you should get a message such as:
qsub: job 75.qcprivate.cluster completed
This tells you that the job has finished, and you are back on qc.uoregon.edu.
To reserve all 8 cpu cores on a single node, you could use:
qsub -I -l nodes=1:ppn=8
To reserve 4 cpu cores on one of the big-memory nodes, you could say:
qsub -I -q bigmem -l nodes=1:ppn=4
The real power of the batch system lies in the ability to submit a job to launch unattended, when sufficient resources become available (especially for larger, parallel codes). Doing this effectively requires some knowledge of scripting languages, such as bash or Perl. You need to have a working knowledge of some scripting language to do serious computation.
In brief, to submit a job using a batch script, you would use
qsub <script_name>
where <script_name> is the name of the script file. If this means nothing to you, you should go through the rest of this section carefully.
As a basic example, take a look a the following batch script (available on the cluster at ~dsteck/batchexamples/basic).
#!/bin/sh
#PBS -N basic
#PBS -j oe
#PBS -o basic.log
#PBS -m abe
#PBS -q normal
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:10:00
# go to proper location
cd $PBS_O_WORKDIR
echo beginning job id $PBS_JOBID '('$PBS_JOBNAME')' on host `hostname`
# get processor info
NPROCS=`wc -l < $PBS_NODEFILE`
echo running on $NPROCS processors
# run some jobs (in this case, just sleep for 60 seconds)
sleep 60
This will run a single command (in this case, sleep 60, which is not necessarily the most useful thing to submit to the cluster), but sets up a number of other useful things; this script is thus a good basic template for you to use. The first line declares this file to be a script to be executed by the sh shell. The next few lines starting with #PBS are ignored by the shell, but they're picked up by the batch system, and allow you to set batch options without using flags to qsub on the command line as in the interactive example above. In the order they appear, these lines do the following: set the name of the job to “basic” (-N); merge together the standard output and error streams to a single file (-j); direct standard output (and thus also standard error) to the file “basic.log” (-o); send you email when the job aborts, begins, or ends (-m); requests the normal queue (-q); requests 1 cpu core on 1 node (-l nodes=<# of nodes>:ppn=<# cpus/node>); and requests 10 minutes of run time (-l walltime=hh:mm:ss).
Just a quick note about “walltime” option: in general, you should have a decent estimate of the run time of your code from testing and debugging runs. When submitting a job, it's best to explicitly specify the runtime, including a bit of padding in case the run takes a bit longer than anticipated. Otherwise, the batch system will use the default values (see above for the defaults by queue). Jobs with shorter runtimes will generally be launched much sooner, as the batch system tries to fill “gaps” where processors are idling while large queued jobs wait for processors to free up to launch. Of course, you should not underestimate the run time, as the batch system will kill your job at the end of the requested wall time.
The rest of the script makes use of some useful variables set by the batch system. For example, $PBS_O_WORKDIR is the directory from which you issued the qsub command, which is presumably where the executable that you want to lauch is located, and thus we cd to that location. The $PBS_NODEFILE variable is the name of a file containing a list of all the nodes that are assigned to this job. If multiple processors on a single node are assigned to the job, the name of the node will be duplicated in this file an appropriate number of times. We are only running a single-processor job here, and so we only use this file to get a count of the number of assigned processors. However, we will use this below when we get to parallel jobs.
Now let's try it. Assuming you have this script in a file named basic, you would just use
qsub basic
and you'd get a response like
81.qcprivate.cluster
which means your job has the label 81. You can check on the status of your job by running
qstat
As an example, the response might look like this:
qc 44 [Thu 09 8:59pm] > qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 80.qcprivate STDIN dsteck 00:00:00 R scavenger 81.qcprivate basic dsteck 0 Q normal
There is already a job running in the scavenger queue (see the “R” status flag?), which is apparently blocking our job, since our job is still queued (see the “Q” status flag?). Once that job finishes, our job launches, as we can see by running qstat again:
qc 45 [Thu 09 8:59pm] > qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 81.qcprivate basic dsteck 0 R normal
More detailed information is available by using some flags to qstat:
qc 46 [Thu 09 8:59pm] > qstat -fans qcprivate.cluster: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 81.qcprivate.clu dsteck normal basic 11317 1 1 -- 48:00 R -- n29/0 --
In particular, this shows that the job has been assigned to processor 0 on node n29. Yet more information about a particular job is available via the checkjob utility, e.g., checkjob 81.
To remove a job from a queue, or to kill a running job, use “qdel #”, where “#” is the job number (e.g., qdel 81).
Information on what nodes are up and/or allocated can be obtained using “pbsnodes -a”.
HPF (High Performance Fortran) allows you to easily convert a well-written Fortran 90 code into a parallel code simply by inserting a few directives (ignored by F90 compilers) to tell HPF how to parallelize things. HPF is available on the cluster via the Portland Group CDK.
As an example, we will run the “matmul.F” that comes with the PG CDK (a slightly tweaked version is at ~dsteck/batchexamples/matmul.F). Copy this to your working location, then compile it using PG's suggested options (the -DPGF90 is specific to this example):
pghpf -fast -Mautopar -Minfo -DPGF90 matmul.F -o matmul_hpf -V
This generates the matmul_hpf executable. To run this code on multiple processors, codes compiled by pghpf need to know how many processors are available and how to access them. In the batch script below, we communicate these by setting some environment variables, running the code on 4 processors for up to 15 minutes. This script is located at ~dsteck/batchexamples/run_matmul.
Note: you should always request the minimum number of nodes for the number of processors you want. That is, if you need 8 processors, request 8 cpus on 1 node, rather than 4 cpus on each of 2 nodes.
#!/bin/sh
#PBS -N matmul
#PBS -j oe
#PBS -o matmul.log
#PBS -m abe
#PBS -q normal
#PBS -l nodes=1:ppn=4
#PBS -l walltime=00:15:00
# go to proper location
cd $PBS_O_WORKDIR
echo beginning job id $PBS_JOBID '('$PBS_JOBNAME')' on host `hostname`
# get processor info
NPROCS=`wc -l < $PBS_NODEFILE`
echo running on $NPROCS processors
# run parallel job
export PGHPF_NP=$NPROCS
export PGHPF_STAT=all
export PGHPF_HOST=-file=$PBS_NODEFILE
./matmul_hpf
MPI (Message-Passing Interface) is a lower-level method for handling communications between different processes. This can be more efficient than HPF, but requires more effort and planning to code. MPI is supported through MPICH libraries via either the Intel or PG compilers.
As an example, we will run a modified version of the “pi3f90.f90” program that comes with the MPICH distribution (at ~dsteck/batchexamples/pi3f90.f90), which computes an integral as a numerical approximation to pi. Copy this to your working location, then compile it with the mpich interface to the ifort (Intel F90) compiler, with some options to tune for qc's processors:
/opt/mpich/intel/bin/mpif90 -O3 -static-intel -tune pn4 -xSP pi3f90.f90 -o pi3A script to run this on 4 processors is below. Note that we use the full path to mpirun to avoid confusion with other versions designed to work with other compilers. (Find this script at ~dsteck/batchexamples/run_pi3.)
#!/bin/sh
#PBS -N pi3
#PBS -j oe
#PBS -o pi3.log
#PBS -m abe
#PBS -q normal
#PBS -l nodes=1:ppn=4
#PBS -l walltime=00:15:00
# go to proper location
cd $PBS_O_WORKDIR
echo beginning job id $PBS_JOBID '('$PBS_JOBNAME')' on host `hostname`
# get processor info
NPROCS=`wc -l < $PBS_NODEFILE`;
echo running on $NPROCS processors
# run parallel job (with timing info)
time \
/opt/mpich/intel/bin/mpirun -v -machinefile $PBS_NODEFILE -np $NPROCS pi3
To instead use the Portland Group compilers, the options are a bit different
for compilation:
/opt/pgi/linux86-64/7.2/mpi/mpich/bin/mpif90 -fast -Minfo pi3f90.f90 -o pi3The script to run this on 4 processors is similar, just notice again the full path to mpirun ( and to the compiler in the step above). (Find this script at ~dsteck/batchexamples/run_pi3_pg.)
#!/bin/sh
#PBS -N pi3
#PBS -j oe
#PBS -o pi3.log
#PBS -m abe
#PBS -q normal
#PBS -l nodes=1:ppn=4
#PBS -l walltime=00:15:00
# go to proper location
cd $PBS_O_WORKDIR
echo beginning job id $PBS_JOBID '('$PBS_JOBNAME')' on host `hostname`
# get processor info
NPROCS=`wc -l < $PBS_NODEFILE`;
echo running on $NPROCS processors
# run parallel job (with timing info), this time compiled with PG compilers
time \
/opt/pgi/linux86-64/7.2/mpi/mpich/bin/mpirun -v -machinefile $PBS_NODEFILE -np $NPROCS pi3
Note: again, you should always request the minimum number of nodes for the number of processors you want. That is, if you need 8 processors, request 8 cpus on 1 node, rather than 4 cpus on each of 2 nodes.
In some cases you have a serial code that you want to run many times. Doing so requires a bit more sophistication, and accordingly we will switch to using Perl as our scripting language.
Below is a script implementing the simplest example, where you request two processors on a single node to run two independent jobs in parallel. To do this we use the fork to spawn child processes that launch the executables. The script then monitors the processes using the waitpid command with flags of 0 for blocking waits (i.e., the waidpid command does not finish until the process exits). This last bit is important, since otherwise the batch system will return the cpus to the pool of idle processors before the programs have finished executing!
To try this example out, get the files code1.f90 and code2.f90, both in ~dsteck/batchexamples/. These just implement 10-second sleeps in Fortran and report that they're done. Compile them using the commands:
ifort -O code1.f90 -o code1 ifort -O code2.f90 -o code2
The following batch script (see ~dsteck/batchexamples/run2jobs) will launch them together, and then wait for both to finish before exiting. We are assuming that multiple processes will run only on a single node, otherwise we should use ssh as in the next example below.
#!/usr/bin/perl
#PBS -N twojobs
#PBS -e twojobs.err
#PBS -o twojobs.out
#PBS -m abe
#PBS -q normal
#PBS -l nodes=1:ppn=2
#PBS -l walltime=00:10:00
### above PBS settings are: name = sample, std error -> sample.err,
### std out -> sample.log, mail notification to the job owner,
### queue = normal, request for 2 processors on 1 node (i.e.
### 2 jobs per node)
use POSIX ":sys_wait_h";
### set locations
# assume the executable names are code1, code2
# assume executables are in the submission directory
$workdir_base = $ENV{'PBS_O_WORKDIR'};
$executable1 = "$workdir_base/code1";
$executable2 = "$workdir_base/code2";
### get node file name from shell
$pbs_nodefile = $ENV{'PBS_NODEFILE'};
### log stuff to standard error
chomp($hostname = `hostname`);
chomp($numprocs = `wc -l <$pbs_nodefile`); $numprocs =~ s/\s//g;
print STDERR "Working base directory is $workdir_base\n";
print STDERR "Running on host $hostname\n";
print STDERR "Time is " . localtime(time) . "\n";
print STDERR "Allocated $numprocs nodes\n";
### launch the first executable on the local machine
# launch executable by forking child processes
# keep pid as $child1
print STDERR "running command: $executable1\n";
$child1 = fork();
if ($child1 == 0) { exec($executable1); }
### launch second executable
print STDERR "running command: $executable2\n";
$child2 = fork();
if ($child2 == 0) { exec($executable2); }
### monitor jobs until they finish (using blocking waits)
waitpid($child1, 0);
waitpid($child2, 0);
Here is a more sophisticated example, where n jobs run on a fixed number (≤n) of processors. This is also appropriate where you need to run the same code many times with different parameters (say 128 different parameter sets), but don't have that many processors (say, only 16). The script parses the node file and maintains a job table to keep track of running jobs. In this example, it would keep 16 jobs running at the same time. As soon as a job exits, the script launches the next job on the corresponding node until there are no more jobs. Notice the use of nonblocking waits (using the NOHANG flag) to monitor the running processes. The script uses ssh to spawn processes on the proper node. (See ~dsteck/batchexamples/runmanyjobs.)
#!/usr/bin/perl
#PBS -N manyjobs
#PBS -e manyjobs.err
#PBS -o manyjobs.out
#PBS -m abe
#PBS -q normal
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:10:00
### above PBS settings are: name = sample, std error -> sample.err,
### std out -> sample.log, mail notification to the job owner,
### queue = normal, request for 16 processors on 2 nodes (i.e.
### 8 jobs per node)
use POSIX ":sys_wait_h";
use strict;
### set locations
# assume the executable names are of the form
# a.out.1, a.out.2, ..., a.out.n
# or
# executable file.1, executable file.2, ...
my $workdir_base = $ENV{'PBS_O_WORKDIR'};
# on the next line: put everything in the command,
# the dot and last number will be appended to this string
# (this could also be a.out for the other example above)
my $job_base = "$workdir_base/serialexample inputfile";
my $num_jobs = 128; # number of jobs to run
### get node file name from shell
my $pbs_nodefile = $ENV{'PBS_NODEFILE'};
### log stuff to standard error
my ($hostname, $number_of_procs);
chomp($hostname = `hostname`);
chomp($number_of_procs = `wc -l <$pbs_nodefile`); $number_of_procs =~ s/\s//g;
print STDERR "Working base directory is $workdir_base\n";
print STDERR "Running on host $hostname\n";
print STDERR "Time is " . localtime(time) . "\n";
print STDERR "Allocated $number_of_procs nodes\n";
### make list of jobs to do, and set up other lists
my $j; my @job_array; my @child_array; my @running_job_array;
for ($j=1; $j<=$num_jobs; $j++) {
push @job_array, $j
}
### make a list of available processors
my @proc_array = ();
open(procfile, "<$pbs_nodefile");
while(<procfile>) { chomp; push @proc_array, $_; }
close(procfile);
### make a job table to keep track of which processors are working;
# a value of 1 means busy, a value of 0 means idle
my @busy_table = ();
for ($j=0; $j<$number_of_procs; $j++) { push @busy_table, 0; }
### run the jobs!
my $running_jobs = 0;
while ( $#job_array >= 0 or $running_jobs > 0 ) {
# spawn new jobs, if needed
while ( $running_jobs < $number_of_procs and $#job_array >= 0 ) {
my $job = shift @job_array;
my $job_command = "$job_base.$job";
# find first free job slot and launch
$j = 0; while($busy_table[$j] == 1) {$j++;}
if ( $j >= $number_of_procs ) {
die "Internal error with allocation table";
}
$busy_table[$j] = 1;
print STDERR "on $proc_array[$j]:\n";
$job_command = "ssh -n $proc_array[$j] 'cd $workdir_base; $job_command'";
print STDERR "Launching job #$job: \n$job_command\n";
print STDERR "\n";
my $child = fork();
if ($child == 0) {
exec($job_command);
} else {
$child_array[$j] = $child;
$running_job_array[$j] = $job;
}
$running_jobs++;
}
# check for finished jobs
for ($j=0; $j<=$#child_array; $j++) {
if ( $busy_table[$j] == 1 ) {
my $child = waitpid($child_array[$j], WNOHANG);
if ( $child == -1 ) {
my $exitstatus = int($?+0.49)/256;
print STDERR "Finished job #$running_job_array[$j]";
if ( $exitstatus != 0 ) {print STDERR ", exit status = $exitstatus";}
print STDERR "\n\n";
$child_array[$j] = -1;
$running_job_array[$j] = -1;
$running_jobs--;
$busy_table[$j] = 0;
}
}
}
# wait a bit so we don't overload the processor with admin stuff
sleep 1;
}
To try out this example, we can use the sample code (another Perl script), which sleeps for a few seconds and spews the contents of the input file. (See ~dsteck/batchexamples/serialexample.)
#!/usr/bin/perl die 'one (filename) argument expected' unless $#ARGV==0; die 'input argument is not a file' unless -e $ARGV[0]; sleep 5; chomp($contents = `cat $ARGV[0]`); print "slept for 5 seconds; contents of file '$ARGV[0]' are '$contents'.\n";
Then, to make the appropriate input files, you would type the following command in bash (type bash first if you aren't using it or aren't sure).
for i in `seq 1 128`; do echo $i > inputfile.$i; done