Computing cluster

IRIC has a dedicated computing cluster and uses the SLURM software to perform job management through its resources. To access it, you must have a Linux account on our servers and use the SSH client in order to connect to the master node of the cluster at the following address:

cluster.iric.ca

All the user directories are exported to this cluster by NFS and you will therefore have access to your data from any node using the same directory (ex: /u/username)

Submitting a job

Job submission is done with the command salloc for interactive jobs and sbatch otherwise, and by specifying the required resources for the job. Simply put, it is possible to connect to a computing node interactively using the following command:

salloc --nodes=1 --ntasks=2 --mem=8gb --time=4:00:00

This command means you will be connected to one of the available node, reserving 2 CPU cores and 8GB of memory, and this for a maximum time of 4 hours, after which you will be disconnected. During this time, you will be able to execute your program interactively on this node, which can be useful when testing analyses. Once you are confident in your analysis, you will want to submit your jobs in batches by creating a script that defines the tasks to be executed, such as:

#!/bin/bash #SBATCH --mem=2g # memory #SBATCH --nodes 1 # nodes #SBATCH --ntasks=4 # cores #SBATCH --time=0-04:00 # time (DD-HH:MM) module load star module load samtools STAR ... samtools ...

The actual submission of this job, for example named test.sh, will then be made with the following command:

sbatch test.sh

Note that job requirements can be specified in the script or on the command line. The stdout and stderr outputs of your programs are directed into your working directories. A job that tries to allocate more memory than specified will be killed whereas a job that would use more cores than requested will be allowed to run but will still be restrained to the requested core number.

For more details on the sbatch parameters, you can look through the official Slurm documentation.

Local scratch

For each job launched, a temporary folder is created on the compute node and can be referred by using the $TMPDIR variable. Local disk space ranges in size and thus a supplementary variable can be specified to select a subset node with sufficient disk space for your usage.

sbatch --nodes 1 --tmp=100gb test.sh # To reserve a node with at least 100GB in $TMPDIR

Monitoring a job

You can display the status of your jobs by using the squeue command. Deleting jobs is done with the scancel command.

squeue scancel

The general status of the computing cluster can be displayed using:

sinfo

or for more details on a given partition:

sinfo -o "%12N %.10c %8O %.8m %20f %20G %10T %10d %.4w %P" -N -p gpu

and through the monitoring tool ganglia: http://bioinfo.iric.ca/ganglia.

GPU reservations

Slurm allows to reserve specific resources for GPU computation using the gres option. For example:

sbatch --nodes 1 --gres=gpu:1 # Reserve a node and a single card, regardless of its type sbatch --nodes 1 --gres=gpu:rtx2080:2 # Reserve a node with 2 RTX2080 cards

Available GPU cards are described in the GRES column of the sinfo command. The $CUDA_VISIBLE_DEVICES environment variable will contain the assigned GPU card ID when a job is launched on a node that has multiple GPUs available.