SLURM

From Rizzo_Lab
Revision as of 21:20, 22 January 2020 by Gduarter (talk | contribs)
Jump to: navigation, search

Header Information

If you are running a simulation that might take more than 5 minutes to complete, do not run it in the login node. Login nodes are not meant for computing. You should write a SLURM script and submit it to the queue.

SLURM stands for Simple Linux Utility for Resource Management and is the software that manages the computer resources available. The header of your SLURM script should contain the following lines:

#!/bin/sh 
#SBATCH --partition= 
#SBATCH --time= 
#SBATCH --nodes= 
#SBATCH --ntasks= 
#SBATCH --job-name= 
#SBATCH --output=

where:

  • #!/bin/sh indicates which shell will be used in the script.
  • #SBATCH --partition= specifies the partition which will receive your job, e.g, --partition=long-40core.
  • #SBATCH --time= specifies the total amount of time allocated for your job. If I wanted a 48h time slot, I would complete --time=48:00:00.
  • #SBATCH --nodes= specifies the total number of nodes that you will allocate to your job. If you are running a molecular dynamics simulation or a large DOCK6 job, you might need to select more than one. Remember to parallelize your jobs, if that is the case.
    • A node is a physical computer that handles computer tasks (processes) and run jobs. It contains CPUs, memory, and devices and it is managed by an operating system.
  • #SBATCH --ntasks= specifies the number of processes to be run.
    • You can allocate more than one CPU per task using the additional header line #SBATCH --cpus-per-task=.
    • If you are running an MPI job, you should specify the flag -np as the product between the number of nodes, the number of tasks and the number of CPUs per task.
  • #SBATCH --job-name= and #SBATCH --output= define the names of the job -- i.e, a job ID recognizable to you -- and the output file name. Good practices dictate that you name your output file following the rules below:
    • %A Job array's master job allocation number.
    • %a Job array ID (index) number.
    • %J jobid.stepid of the running job. (e.g. "128.0")
    • %j jobid of the running job.
    • %N Short hostname. This will create a separate IO file per node.
    • %n Node identifier relative to current job (e.g. "0" is the first node of the running job) This will create a separate IO file per node.
    • %s stepid of the running job.
    • %t Task identifier relative to current job. This will create a separate IO file per task.
    • %u Username.
    • %x Job name.
    • e.g, #SBATCH --output=%x-%j.o. This line will prompt SLURM to create an output file with your job name and its ID number.

Submitting and checking the status of your job

You submit your job by typing:

 sbatch your_slurm_script_name.sh

You can check the status of your jobs by typing:

 squeue -u your_username


Troubleshooting Information

Simulation results sometimes depend on machine conditions. In order to avoid opening and closing files while trying to troubleshoot your results, it is recommended to add the following lines to your SLURM script:

 echo "============================= SLURM JOB ================================="
 date
 echo 
 echo " The job will be started on the following node(s):"
 echo $SLURM_JOB_NODELIST
 echo
 echo "Slurm user:                   $SLURM_JOB_USER"
 echo "Run directory:                $(pwd)"
 echo "Job ID:                       $SLURM_JOB_ID"
 echo "Job name:                     $SLURM_JOB_NAME"
 echo "Partition:                    $SLURM_JOB_PARTITION"
 echo "Number of nodes:              $SLURM_JOB_NUM_NODES"
 echo "Number of tasks:              $SLURM_NTASKS"
 echo "Submitted from:               $SLURM_SUBMIT_HOST:$SLURM_SUBMIT_DIR"
 echo "========================================================================="

The names after the $ are called environment variables and you should keep track of them. Knowing them will make your life easier, believe me.

Read the Linux wiki page for more information.