The Libra Cluster

Submitting Batch Jobs to LoadLeveler

To optimize job throughput and minimize paging on the Libra Cluster, all CPU-intensive jobs (i.e., jobs requiring more than 20 minutes of CPU time) must be submitted as batch jobs to a batch scheduler. The batch scheduling software installed on the Libra Cluster is IBM's LoadLeveler, which uses a load-balancing algorithm to dispatch jobs to the least busy host. In addition, LoadLeveler provides such functions as job prioritizing, hold/release and an easy-to-use graphical user interface for creating, submitting and manipulating jobs. LoadLeveler's functionality is extended via its interface to an external scheduler, the Maui Scheduler, which offers superior backfill scheduling, quality of service control, advanced reservations, and more.

Your home directory resides on the UITS EMC storage system, and is accessible from all nodes. You may submit jobs to LoadLeveler from the any of the interactive nodes, libra00, libra01 or libra02, and may read/write your input and output files from/to your home directory or from/to the local /tmp and /scr directories, or to the shared scratch filesystems, /scratch1 and /scratch2. When your job is started, your default login shell is invoked, along with your .login and .cshrc files (for C shell users) or your .profile and .kshrc files (for Korn and Bourne shell users). Be sure that these files do not contain any commands which would not work in a batch environment (e.g.,the stty command, which expects stdout to be a terminal).

Before you can submit your job to LoadLeveler, you must first determine what its requirements are in terms of CPU time, memory, and software. Fortran and C jobs and other serial applications such as SAS, SPSS, Matlab, Mathematica, etc. will use class serial. Jobs requiring greater than > 2GB but less than 4GB of memory should be submitted to class nonshared. Jobs requiring multiple cpus and/or > 4GB of memory should be submitted to class smp. You must also specify a feature code if the software you require is not available on all nodes which run the desired job class. A description of the Libra Cluster LoadLeveler batch classes and node configuration follows:

LoadLeveler Batch Job Classes


 Class        Description


 serial       1 cpu, < 2GB memory, 14-day wallclock time 
 nonshared    1-2 cpus, < 4GB memory, 14-day wallclock time 
 smp          1-16 cpus, > 4GB memory, 14-day wallclock 

LoadLeveler Node Configuration

See LoadLeveler Configuration.

Sample LoadLeveler Scripts

To communicate with LoadLeveler, you can use the Xwindows graphical user interface, xloadl, or LoadLeveler commands (e.g., llsubmit, llq, llcancel). The xloadl interface allows you to create, submit, query and manipulate your batch jobs. If you do not have an X- capable display or you prefer LoadLeveler command mode, then to submit jobs you must first create a shell script which contains all the necessary information to set up and run the desired tasks. Within the LoadLeveler job script you specify job options using keywords on a line beginning with the special characters #@. Job options include job class, working directory, executable name, soft- ware requirements, where to redirect stdin, stdout and stderr, and notification type.

Serial Job Examples

 
#@ class = serial
#@ initialdir = /N/u/joedoe/libra
#@ input = myprogram.input
#@ output = myprogram.output
#@ error = myprogram.error
#@ arguments = arg1 arg2 arg3
#@ executable = myprogram
#@ queue

Note: It is very important that you assign values to output and error if your program writes to stdout and/or stderr. If not specified, these default to /dev/null.

If you want to ensure that the output from each job goes to a unique file, you can use the values assigned by LoadLeveler to the Executable, Cluster, and Process values. The Cluster value is a unique jobid; a Process value is assigned to each process queued within a script. Here is an example of a script which executes the myjob program twice, using different arguments and creating two unique output files:

#@ class = serial
#@ initialdir = /N/u/joedoe/libra/bin
#@ executable = myprogram
#@ output = $(Executable).$(Cluster).$(Process).
#@ arguments = arg1 arg2 arg3
#@ queue
#@ arguments = arg4 arg5 arg6
#@ queue

If you do not specify an executable name, LoadLeveler assumes that anything following the #@ queue statement is the command(s) to be executed. Use this format if you want to run a shell script rather than a compiled program, or if there are several commands which need to be run in sequence. Here is an example:

#@ class = serial
#@ initialdir = /N/u/joedoe/libra
#@ output = myprog.$(Cluster).out
#@ error = myprog.$(Cluster).err
#@ queue
xlf -o myprog myprog.f            
myprog
rm /scr/joedoe/myprog.workfile

The following script submits a SAS job:

#@ class = serial
#@ initialdir = /N/u/joedoe/libra/sasjobs
#@ error = mysasjob.err
#@ output = mysasjob.out
#@ queue
sas mysasjob.sas

The following script submits a Mathematica job:


#@ class = serial
#@ initialdir = /N/u/joedoe/libra/math
#@ input = math.input
#@ output = math.output	
#@ error = math.err
#@ queue
math 

The following script submits an Splus job. Splus is a statistics package that has a node-locked license on libra05; you target your job to the correct node by requesting the defined feature code, in this case "splus":


#@ class = serial
#@ requirements = (Feature == "splus")
#@ initialdir = /N/u/joedoe/libra/Splus
#@ input = Splus.commandfile
#@ output = Splus.output 
#@ error = Splus.error
#@ queue
Splus 

Note: If you type in a requirement that is not available, you will NOT receive a message. LoadLeveler will hold your job in the queue until the resource becomes available. Be sure the resource is specified correctly.

Type llconfig to see the list of defined Feature codes.

Nonshared Job Example

Users of the Gaussian03 software package, which can use more than one cpu, and other programs which require 1-2 cpus and > 2GB of memory will need to submit your jobs to nodes which are not shared by other user programs. This is done by submitting these jobs to the nonshared queue:

#@ class = nonshared
#@ initialdir = /N/u/joedoe/libra/gaussian03
#@ output = run2.output
#@ error = run2.error
#@ queue
setenv g03root /software/sciapps/g03
source \$g03root/g03/bsd/g03.login
setenv GAUSS_SCRDIR /scr/\$USER
/bin/mkdir $GAUSS_SCRDIR
g03 run2.com run2.log
/bin/rm -R \$GAUSS_SCRDIR


Parallel Job Example

The following script submits an a 16-cpu OpenMP parallel job. Note that although you specify the number of cpus to reserve using the tasks_per_node keyword, your OpenMP program will only spawn the number of threads specified by the value of the OMP_NUM_THREADS environment variable, so you must set the value in the batch script, as follows:

#@ class = smp
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 16
#@ initialdir = /N/u/joedoe/libra/myprogs
#@ executable = myopenmpcode
#@ input = inputfile1
#@ output = $(Executable).$(Cluster).output
#@ error = $(Executable).$(Cluster).error 
#@ environment = COPY_ALL; OMP_NUM_THREADS=16
#@ queue

The Libra Cluster does not have a high-speed low-latency interconnect, and MPI (distributed memory parallel) jobs are not supported.

LoadLeveler Keywords

The following table lists applicable LoadLeveler keywords, their function, and allowed values.
 
 
+---------+-----------------------------------+-----------------+
|Keyword  |Description                        |Syntax           |
+---------+-----------------------------------+-----------------+
|argu-    |Specifies the list of arguments to |arguments = arg1 |
|ments    |pass to your program when your job |arg2 arg3 ...    |
|         |runs.                              |                 |
+---------+-----------------------------------+-----------------+
|check-   |Specifies whether you want to      |checkpoint = yes |
|point    |checkpoint your program.  If yes,  |  | no           |
|         |your program must first be linked  |                 |
|         |with the LoadLeveler C or FORTRAN  |Default is no.   |
|         |libraries via the llcc or llxlf    |                 |
|         |commands.  Checkpoints occur every |                 |
|         |2 hours and allow jobs to survive  |                 |
|         |machine failures.   Jobs which     |                 |
|         |fork or use signals, dynamic load- |                 |
|         |ing, shared memory, semaphores,    |                 |
|         |messages, internal timers, set     |                 |
|         |user/group id, or are not          |                 |
|         |idempotent (i.e., I/O operations,  |                 |
|         |when repeated, do not yield the    |                 |
|         |same result), should not be check- |                 |
|         |pointed.                           |                 |
+---------+-----------------------------------+-----------------+
|class    |Specifies the name of a job class. |class = serial | |
|         |                                   | nonshared | smp |
|         |The job class is the same as a job |                 |
|         |queue.  Each class has defined     |                 |
|         |CPU time and elapsed time limits.  |Default is serial|
+---------+-----------------------------------+-----------------+
|core_limi|Specifies the maximum size of a    |core_limit =     |
|t        |core file.  When a job exceeds the |hardlimit,softlim| 
|         |softlimit, it receives a signal.   |it               |
|         |When a job reaches the hardlimit,  |For example,     |
|         |it is terminated.   Limits are ex- |core_limit =     |
|         |pressed as .    |1mb,0.8mb        |
|         |units where units may be b         |                 |
|         |(bytes), w (words), kb (kilo-      |Defaults to the  |
|         |bytes), mb (megabytes), mw (mega-  |AIX user core    |
|         |words), or gb (gigabytes).         |limit of 1mb.    |
+---------+-----------------------------------+-----------------+
|cpu_limit|Specifies the maximum amount of    |cpu_limit =      |
|         |CPU time that a submitted job can  |hardlimit,softlim| 
|         |use.  Express the limit as         |it               |
|         |hours:minutes:seconds. Defaults    |For example,     |
|         |to the cpu limit for the job class.|cpu_limit =      |
|         |                                   |12:00:00,11:50:00|
+---------+-----------------------------------+-----------------+
|data_limi|Specifies the maximum size of the  |data_limit =     |
|t        |data segment to be used by the     |hardlimit,softlim| 
|         |submitted job.                     |it               |
|         |                                   |Defaults to      |
|         |                                   |unlimited.       |
+---------+-----------------------------------+-----------------+
|dependenc|Specifies dependencies between job |dependency =     |
|y        |steps.  Syntax is dependency =     |(step1==0) &&    | 
|         |step_name operator returncode      |(step2 > 0)      |
|         |where step_name must be a previous-|                 |
|         |ly defined job step and operator is|                 |
|         |==,!=,<=,>=,<,>,&&, or ||.         |                 |
+---------+-----------------------------------+-----------------+
|environ- |Specifies your initial environment |environment =    |
|ment     |variables when your job starts.    |env1 ; env2 ;    |
|         |Separate the variables by semico-  |...              |
|         |lons.  Specify COPY_ALL to copy    |                 |
|         |all the environment variables from |For example,     |
|         |your shell, $var to copy an in-    |                 |
|         |dividual variable, !var to prevent |environment =    |
|         |the copying of a variable,         |COPY_ALL ;       |
|         |and var=value to set the value     |!DISPLAY ;       |
|         |of a variable and then copy it.    |                 |
+---------+-----------------------------------+-----------------+
|error    |Specifies the name of the file to  |error = filename |
|         |use as standard error (stderr)     |                 |
|         |when your job runs.                |Defaults to      |
|         |                                   |/dev/null.       |
|         |                                   |                 |
+---------+-----------------------------------+-----------------+
|executa- |Identifies the name of the program |executable =     |
|ble      |to run.  If not specified,         |filename         |
|         |LoadLeveler uses the job script    |                 |
|         |file as the executable.            |                 |
+---------+-----------------------------------+-----------------+
|file_limi|Specifies the maximum size of      |file_limit =     |
|t        |files created by the job.          |hardlimit,softlim| 
|         |                                   |it               |
|         |                                   |Defaults to the  |
|         |                                   |AIX user file    |
|         |                                   |limit of 2GB.    |
+---------+-----------------------------------+-----------------+
|hold     |Specifies whether you want to      |hold = user |    |
|         |place a hold on your program when  |system | usersys |
|         |you submit it.  There are three    |                 |
|         |types of hold: user, system and    |                 |
|         |usersys.   Only a system adminis-  |                 |
|         |trator can release a job in system |                 |
|         |or combined usersys hold.  The     |                 |
|         |user releases the job from user    |                 |
|         |hold using the llhold -r command.  |                 |
+---------+-----------------------------------+-----------------+
|initialdi|The pathname of the directory to   |initialdir =     |
|r        |use as the initial working direc-  |pathname         |
|         |tory during execution of the job.  |                 |
|         |If none is specified, the initial  |                 |
|         |directory is the current working   |                 |
|         |directory at the time you submit-  |                 |
|         |ted the job.                       |                 |
+---------+-----------------------------------+-----------------+
|input    |Specifies the name of the file to  |input = filename |
|         |use as standard input (stdin) when |                 |
|         |your job runs.                     |Defaults to      |
|         |                                   |/dev/null.       |
+---------+-----------------------------------+-----------------+
|job_cpu_ |Specifies the maximum CPU time to  |job_cpu_limit =  |
| limit   |be used by all processes of a job  |  12:00:00.0     |
|         |step.  Syntax is                   |Defaults to class|
|         | hours:minutes:seconds.fraction    |  time limit.    |
+---------+-----------------------------------+-----------------+
|job_name |Specifies the name of the job.     |job_name =       |
|         |Used in long reports for llq and   |  my_awesome_job |
|         |llstatus, and in mail related to   |                 |
|         |the job.                           |                 |
+---------+-----------------------------------+-----------------+
|job_type |Specifies whether the job is       |job_type =       |
|         |a single-cpu job or can run on     |  serial |       |
|         |multiple processors.               |  parallel       |
|         |                                   |Default=serial   |
+---------+-----------------------------------+-----------------+
|node     |Specifies the number of nodes      |node = min,max   |
|         |required for a parallel job.       |   or            |
|         |                                   | node = n        |
+---------+-----------------------------------+-----------------+
|node_-   |Specifies whether the job can share|node_usage =     |
|usage    |the node with other jobs.          | shared | not_   |
|         |                                   | shared          |
|         |                                   | Default=shared  |
+---------+-----------------------------------+-----------------+
|notifi-  |Specifies when the user specified  |notification =   |
|cation   |in notify_user is sent mail.       |always | error | |
|         |                                   |start | never |  |
|         |                                   |complete         |
|         |                                   |                 |
|         |                                   |Default is com-  |
|         |                                   |plete.           |
+---------+-----------------------------------+-----------------+
|notify_us|Specifies the user to whom notifi- |notify_user =    |
|er       |cation mail is sent.               |username         |
|         |                                   |                 |
|         |                                   |Default is job   |
|         |                                   |owner.           |
+---------+-----------------------------------+-----------------+
|output   |Specifies the name of the file to  |output =         |
|         |use as standard output (stdout)    |filename         |
|         |when your job runs.                |                 |
|         |                                   |Defaults to      |
|         |                                   |/dev/null.       |
+---------+-----------------------------------+-----------------+
|prefer-  |List of characteristics that you   |preferences =    |
|ences    |prefer be available on the target  |Boolean ex-      |
|         |machine.   If a machine which      |pression         |
|         |meets the preferences is not       |                 |
|         |available, LoadLeveler will assign |                 |
|         |the job to a machine which meets   |                 |
|         |the requirements.  See the re-     |                 |
|         |quirements keyword, below.         |                 |
+---------+-----------------------------------+-----------------+
|queue    |Places one copy of the job in the  |queue            |
|         |queue.  If desired, you can spec-  |                 |
|         |ify input, output, error and argu- |                 |
|         |ment statements between queue      |                 |
|         |statements.                        |                 |
+---------+-----------------------------------+-----------------+
|require- |List of requirements the remote    |requirements =   |
|ments    |machine must meet to execute the   |Boolean ex-      |
|         |job script.  The requirements sup- |pression         |
|         |ported are:                        |                 |
|         |                                   |Examples:        |
|         |o   Memory  (The amount of phys-   |requirements =   |
|         |    ical memory required in mega-  |(Feature ==      |
|         |    bytes.                         |     "sas")      |
|         |                                   |                 |
|         |                                   |requirements =   |
|         |o   Feature (Required software or  |((Memory >= 2048)|
|         |    some other locally defined     | && (Feature ==  |
|         |    feature.  See the Available    |       "math")   |
|         |    Software table in Section 1    |                 |
|         |    for the Feature names associ-  |requirements =   |
|         |    ated with specific software    |(Machine ==      |
|         |    products.                      |   "libra09")    |
|         |                                   |                 |
|         |o   Machine (hostname of the tar-  |                 |
|         |    get machine)                   |                 |
|         |                                   |                 |
|         |o   Disk (kilobytes of disk space  |                 |
|         |    available in LoadLeveler's     |                 |
|         |    working directory on the tar-  |                 |
|         |    get machine)                   |                 |
|         |                                   |                 |
|         |o   Arch (Target machine's archi-  |                 |
|         |    tecture.  Defaults to that of  |                 |
|         |    the submitting machine.  All   |                 |
|         |    AIX nodes are defined as       |                 |
|         |    "R6000".)                      |                 |
|         |                                   |                 |
|         |o   OpSys (Target machine's oper-  |                 |
|         |    ating system.  Defaults to     |                 |
|         |    that of the submitting ma-     |                 |
|         |    chine.   The Libra nodes are   |                 |
|         |    defined with "AIX53".)         |                 |
+---------+-----------------------------------+-----------------+
|restart  |Restart the job if LoadLeveler     |restart = no     |
|         |  abends or system crashes.        |Defaults to yes  |
+---------+-----------------------------------+-----------------+
|rss_limit|Specifies the maximum resident set |rss_limit =      |
|         |size.                              |hardlimit,softlim| 
|         |                                   |it               |
|         |                                   |Default is       |
|         |                                   |unlimited.       |
+---------+-----------------------------------+-----------------+
|shell    |Specifies the name of the shell to |shell = name     |
|         |use for the job.  If not speci-    |                 |
|         |fied, the shell specified in the   |                 |
|         |owner's password file entry is     |                 |
|         |used.                              |                 |
+---------+-----------------------------------+-----------------+
|stack_lim|Specifies the maximum size of the  |stack_limit =    |
|it       |stack.                             |hardlimit,softlim| 
|         |                                   |it               |
|         |                                   |Default is the   |
|         |                                   |AIX user stack   |
|         |                                   |limit of 2GB.    |
+---------+-----------------------------------+-----------------+
|startdate|Specifies when you want to run the |startdate = date |
|         |job.  Express startdate as         |time             |
|         |MM/DD/YY HH:MM(:SS).               |                 |
|         |                                   |Defaults to cur- |
|         |                                   |rent date and    |
|         |                                   |time.            |
+---------+-----------------------------------+-----------------+
|step_name|Specifies the name of the job step.|step_name =      |
|         |your job.  Used for dependencies   |   step_1        |
|         |between job steps.  Do not use T   |                 |
|         |or F or start the name with a      |Defaults to "0", |
|         |number.                            | "1","2",...     |
|         |Do not use if task_per_node used.  |                 |
+---------+-----------------------------------+-----------------+
|tasks_per|Specifies the number of tasks to   |task_per_node =  |
|_node    |run on each node assigned to a     | nn              |
|         |parallel job.  Default is 1.       |                 |
+---------+-----------------------------------+-----------------+
|user_prio|Sets the initial user priority of  |user_priority =  |
|rity     |your job.  It orders jobs you sub- |number           |
|         |mitted with respect to other jobs  |                 |
|         |submitted by you.   Priority can   |Default is 50.   |
|         |be 0 to 100.                       |                 |
+---------+-----------------------------------+-----------------+
|wall_-   |Sets the hard and/or soft limit for|wall_clock_limit |
|clock_-  |elapsed time a job can run.        |  = hardlimit,   |
|limit    |                                   |    softlimit    |
|         |                                   |                 |
+---------+-----------------------------------+-----------------+

Submitting and Monitoring the Job

Once your job script has been created, type chmod u+x to make the script executable. Then submit the job script to LoadLeveler by typing
            llsubmit  scriptname

If the submission is successful, LoadLeveler returns a jobid. To determine the current status of all submitted jobs, type llq. To see the status of your jobs only, type llq -u . For a more detailed listing of your jobs' status, type llq -u youruserid -l. If your job seems to be stuck in Idle state, you can get some information about why the job has not been started by typing llq -l -s yourjobid. The Maui command checkjob -v yourjobid will provide further diagnostic information. For an explanation of the columns displayed by llq, type man llq.

Jobs which have been submitted to LoadLeveler, whether waiting to run or already running, may be cancelled via the llcancel command The syntax is

           llcancel yourjobid

LoadrLeveler has been set up on the Libra Cluster so that a user cannot run more than four jobs simultaneously. If you submit more than eight jobs to the queue at one time, LoadLeveler will not consider the additional jobs eligible for queueing until previous jobs have completed. The default LoadLeveler FIFO Scheduler has been replaced by the Maui Scheduler, which uses a fairshare algorithm to determing the dispatch order of submitted jobs. Thus Maui considers both the submit time as well as the users' recent past use of batch cycles to prioritize jobs. To display fairshare usage statistics type showfairshare.

To determine the status of all machines running LoadLeveler, type llstatus. Type man llstatus for a description of the fields displayed.