Queues are used by the batch scheduler to aid in the organization of jobs. An individual user may have up to 5 jobs eligible to start at any one time (regardless of how many jobs may be already running), while an account may have a total of 10 jobs eligible to run across all the users charging against that account. Jobs in excess of these limits will not be considered for execution. Note that these limits apply to the number of jobs eligible to run, not the number of jobs running.
For example, if you submit 12 jobs, 5 would be eligible, and 7 would be blocked (with an "Idle" state). If three of the jobs run, some blocked jobs will be released so that there are still 5 eligible jobs, and 4 blocked jobs. This continues if all jobs are run. This is done to make it easier to schedule the jobs (there are fewer jobs to consider), and to prevent a single user from dominating the system with many small jobs.
Job priority on Kraken is based on the number of cores and wall clock time requested. Jobs with large core counts (over 32K processors) intentionally get the highest priority on Kraken. Many jobs with small core counts may be run on Athena or other TeraGrid systems, therefore their priority is lower on Kraken. Jobs with smaller core counts do run effectively on Kraken as backfill. While the scheduler is collecting nodes for larger jobs, those with short wall clock limits and small core counts may use those nodes temporarily without delaying the start time of the larger job. For a better explanation of backfilling jobs and NICS scheduling policies point your browser to NICS Scheduling Policies.
By default, jobs on Kraken are sorted into a number of queues based on their size and (for the longsmall queue) their walltime. Long jobs (ie, the longsmall queue) can prevent the machine from being scheduled efficiently, therefore the longsmall queue is limited to 256 cores between all users. Wall clock limit for all queues on Athena is 72 hours. In order to get jobs to run more quickly on Kraken, it is highly recommended that you break your jobs into 24 hour segments instead.
Kraken Queue |
Min Size |
Max Size |
Max Wall |
|
|
Athena Queue |
Min Size |
Max Size |
Max Wall |
|---|---|---|---|---|---|---|---|---|---|
| small | 0 | 512 | 24:00:00 | small | 0 | 512 | 72:00:00 | ||
| *longsmall | 0 | 256 | 60:00:00 | ||||||
| medium | 513 | 8192 | 24:00:00 | medium | 513 | 2048 | 72:00:00 | ||
| large | 8193 | 49536 | 24:00:00 | large | 2049 | 8192 | 72:00:00 | ||
| capability | 49537 | 98352 | 24:00:00 | ||||||
| dedicated | 98353 | 99072 | 24:00:00 |
* Requests for jobs on Kraken must be multiples of 12 (4 for Athena). For example, the largest "small" job on kraken would request 504 cores.
HPSS Queue
The "hpss" queue can be used to transfer files or directories to hpss using a batch file. Jobs running in this queue are not allocated compute nodes, so the aprun command will fail and should not be added to batch files submitted to this queue. You may only submit jobs to this queue if you logged into a node with
your RSA SecurID OTP token. Using job dependencies, you can schedule an HPSS job
to stage data before and/or after a normal production job.
Following is an example of a simple HPSS batch file with a dependency.
This will simply be submitted using username@kraken:/lustre/scratch/username> qsub filename
#!/bin/bash #PBS -A TG-EXAMPLE #PBS -l size=0 #PBS -l walltime=10:00:00 #PBS -q hpss #PBS -W depend=afterok:123456 cd $PBS_O_WORKDIR hsi put file htar cvf this_run.tar dir/
- #PBS -q hpss - submit jobs to the hpss queue
- #PBS -l size=0 - must be included as no compute nodes are needed
- #PBS -W depend=afterok:123456- stage the transfer after job 123456
- #PBS -l walltime=10:00:00 - The wall clock limit for the hpss queue is 24 hours.

