Difference between revisions of "Useful Slurm Commands"

From Rizzo_Lab
Jump to: navigation, search
Line 1: Line 1:
 +
 +
 +
 +
==Basic Slurm commands==
 +
 +
* squeue -u $(whoami) - displays your running jobs and their job ids
 +
* sinfo -l -a | grep idle - displays available nodes (note: some nodes may have exclusive access to certain lab groups)
 +
* scancel <jobid> - remove the job from the queue
 +
 +
 +
 +
 +
==Advanced Tricks==
 +
Delete all your jobs (use with caution)
 +
scancel -u $(whoami)
 +
 +
Delete all your queued jobs only. Leaves all runnings jobs alone.
 +
scancel -u $(whoami) -t "PENDING"
 +
 +
 
==Sample slurm script for many independent jobs==
 
==Sample slurm script for many independent jobs==
 
  #!/bin/bash
 
  #!/bin/bash
  #SBATCH --time=<desired time up to node upper time limit> #lower times get higher priority in the queuing system
+
  #SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
  #SBATCH --nodes=3
+
  #SBATCH --nodes=3 #number
 
  #SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
 
  #SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
 
  #SBATCH --job-name=<your_job_name>
 
  #SBATCH --job-name=<your_job_name>
Line 11: Line 31:
 
  JOB_FILE="<your common input file>"
 
  JOB_FILE="<your common input file>"
 
  CDIR=$(pwd)
 
  CDIR=$(pwd)
  # Assuming your paths to experiments are listed line by line in a file named paths.txt
+
  # Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
 
  while IFS= read -r path; do
 
  while IFS= read -r path; do
 
   cp $JOB_FILE -t $path
 
   cp $JOB_FILE -t $path
Line 21: Line 41:
 
  done < paths.txt
 
  done < paths.txt
 
  wait #necessary to prevent the script from ending and terminating your jobs
 
  wait #necessary to prevent the script from ending and terminating your jobs
 
 
==Basic Slurm commands==
 
 
* squeue -u $(whoami) - displays your running jobs and their job ids
 
* sinfo -l -a | grep idle - displays available nodes (note: some nodes may have exclusive access to certain lab groups)
 
* scancel <jobid>- remove the job from the queue
 
 
 
Use the man or info command in Unix to get more details of usage for these commands.
 
You can use PBS commands inside your script. These are usually optional, but can be useful.
 
There is an example script see [[NAMD on Seawulf]]
 
 
#PBS -l nodes=8:ppn=2
 
#PBS -l walltime=08:00:00
 
#PBS -N my_job_name
 
#PBS -M user@ic.sunysb.edu
 
#PBS -o namd.md01.out
 
#PBS -e namd.md01.err
 
#PBS -V
 
 
Using these PBS directives will name the job, the output files etc.
 
#PBS -j oe
 
#PBS -o pbs.out   
 
This will join the output and error streams into the output file.
 
$PBS_O_WORKDIR is an environment variable that contains the path the script was submitted in. Usually you want to define a specific workdir and use that instead of relying on this variable.
 
 
==Advanced Tricks==
 
Delete all your jobs (either will work, check different versions of PBS)
 
qstat -u sudipto | awk -F. '/sudipto/{print $1}' | xargs qdel
 
qstat | grep sudipto | awk -F. '{print $1}' | xargs qdel
 
Delete all your queued jobs only. Leaves all runnings jobs alone.
 
qstat -u sudipto | awk -F. '/00 Q /{printf $1" "}' | xargs qdel
 
qstat -u sudipto | awk -F. '/ Q  --/{printf $1" " }' | xargs qdel
 
List all working nodes in the queue
 
pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node
 
 
set NODELIST=`pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node`
 
foreach node ($NODELIST)
 
/usr/local/torque-2.1.6/bin/qsub -l nodes=${NODE} ${HOME}/get.nodes.stats.csh
 
done
 
 
Run on a particular type of node
 
  qsub -l nodes=1:beta:ppn=1 ${script}
 

Revision as of 15:38, 22 February 2024


Basic Slurm commands

  • squeue -u $(whoami) - displays your running jobs and their job ids
  • sinfo -l -a | grep idle - displays available nodes (note: some nodes may have exclusive access to certain lab groups)
  • scancel <jobid> - remove the job from the queue



Advanced Tricks

Delete all your jobs (use with caution)

scancel -u $(whoami)

Delete all your queued jobs only. Leaves all runnings jobs alone.

scancel -u $(whoami) -t "PENDING" 


Sample slurm script for many independent jobs

#!/bin/bash
#SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
#SBATCH --nodes=3 #number
#SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
#SBATCH --job-name=<your_job_name>
#SBATCH --output=<std_out_filename>
#SBATCH -p <partition_name>  
 
DOCK_PATH="<path to dock bin/executable>"
JOB_FILE="<your common input file>"
CDIR=$(pwd)
# Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
while IFS= read -r path; do
  cp $JOB_FILE -t $path
  cd $path
  base=$(basename -s .in $JOB_FILE)
  # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give   the job a timelimit
  srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
  cd $CDIR
done < paths.txt
wait #necessary to prevent the script from ending and terminating your jobs