Difference between revisions of "Useful Slurm Commands"

From Rizzo_Lab
Jump to: navigation, search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Sample slurm script for many independent jobs==
 
  #!/bin/bash
 
  #SBATCH --time=<desired time up to node upper time limit> #lower times get higher priority in the queuing system
 
  #SBATCH --nodes=3
 
  #SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
 
  #SBATCH --job-name=<your_job_name>
 
  #SBATCH --output=<std_out_filename>
 
  #SBATCH -p <partition_name>  \n
 
  DOCK_PATH="<path to dock bin/executable>"
 
  JOB_FILE="<your common input file>"
 
  CDIR=$(pwd)
 
  
  # Assuming your paths to experiments are listed line by line in a file named paths.txt
 
  while IFS= read -r path; do
 
    cp $JOB_FILE -t $path
 
    cd $path
 
   
 
    base=$(basename -s .in $JOB_FILE)
 
    # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give  the job a timelimit
 
    srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
 
  
    cd $CDIR
 
  done < paths.txt
 
  wait #necessary to prevent the script from ending and terminating your jobs
 
  
 +
==Basic Slurm commands==
  
==Basic PBS commands==
+
* squeue -u $(whoami) - displays your running jobs and their job ids
 +
* scancel <jobid> - remove the job from the queue
  
* qsub <script> - will submit your script to the queue
 
* qstat - will display a list of jobs running
 
* qdel <jobid> - remove the job from the queue
 
  
Job id                    Name            User            Time Use S Queue
+
==Advanced Tricks==
  ------------------------- ---------------- --------------- -------- - -----
+
Delete all your jobs (use with caution)
514722.nagling            STDIN            liuxt12        1657:13: R batch
+
  scancel -u $(whoami)
514723.nagling            STDIN            liuxt12        3366:45: R batch
 
514724.nagling            jet3dKn0.002PR20 wli            874:42:4 R batch
 
514725.nagling            ...dKn0.002PR100 wli            855:59:2 R batch
 
514803.nagling            jet3dKn0.002PR10 wli            596:48:0 R batch
 
514809.nagling            SAM              mriessen        542:37:3 R batch
 
514811.nagling            STDIN            justin          00:00:32 R batch
 
514815.nagling            latency_test    penzhang              0 Q batch
 
514822.nagling            flex_1.dock.csh  xinyu          34:26:46 R batch
 
514839.nagling            flex_2.dock.csh  xinyu          31:22:33 R batch
 
514856.nagling            flex_5.dock.csh  xinyu          26:33:35 R batch
 
514859.nagling            ....1_2.dock.csh xinyu          06:13:43 R batch
 
514862.nagling            p0.10            hjli            326:10:5 R batch
 
514863.nagling            p0.4            hjli            325:49:5 R batch
 
514864.nagling            p0.7            hjli            332:11:4 R batch
 
514884.nagling            ...1_p0.0_11.csh xinyu          17:16:57 R batch
 
514920.nagling            STDIN            lli            00:34:43 R batch
 
  
On seawulf the 'nodes' command shows a list of free nodes
+
Delete all your queued jobs only. Leaves all runnings jobs alone.
 +
scancel -u $(whoami) -t "PENDING"
  
        avail  alloc  down
+
View only idle nodes - displays available nodes (note: some nodes may have exclusive access to certain lab groups)
nodes: 35      165    25
+
  sinfo -l -a | grep idle
  wulfie: 35      165    25
 
  
Use the man or info command in Unix to get more details of usage for these commands.
+
Check out a node to run jobs interactively (so that you don't use the login nodes!!!)
You can use PBS commands inside your script. These are usually optional, but can be useful.
+
srun -N 1 -n 28 -t 8:00:00 -p long-28core --pty bash
There is an example script see [[NAMD on Seawulf]]
 
  
  #PBS -l nodes=8:ppn=2
+
==Sample slurm script for many independent jobs==
  #PBS -l walltime=08:00:00
+
  #!/bin/bash
  #PBS -N my_job_name
+
  #SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
  #PBS -M user@ic.sunysb.edu
+
  #SBATCH --nodes=3 #number
  #PBS -o namd.md01.out
+
  #SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
  #PBS -e namd.md01.err
+
  #SBATCH --job-name=<your_job_name>
  #PBS -V
+
  #SBATCH --output=<std_out_filename>
 
+
  #SBATCH -p <partition_name> 
Using these PBS directives will name the job, the output files etc.
+
 
  #PBS -j oe
+
  DOCK_PATH="<path to dock bin/executable>"
  #PBS -o pbs.out   
+
JOB_FILE="<your common input file>"
This will join the output and error streams into the output file.
+
CDIR=$(pwd)
$PBS_O_WORKDIR is an environment variable that contains the path the script was submitted in. Usually you want to define a specific workdir and use that instead of relying on this variable.
+
# Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
 
+
  while IFS= read -r path; do
==Advanced Tricks==
+
  cp $JOB_FILE -t $path
Delete all your jobs (either will work, check different versions of PBS)
+
  cd $path
  qstat -u sudipto | awk -F. '/sudipto/{print $1}' | xargs qdel
+
  base=$(basename -s .in $JOB_FILE)
qstat | grep sudipto | awk -F. '{print $1}' | xargs qdel
+
  # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give   the job a timelimit
Delete all your queued jobs only. Leaves all runnings jobs alone.
+
  srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
qstat -u sudipto | awk -F. '/00 Q /{printf $1" "}' | xargs qdel
+
  cd $CDIR
qstat -u sudipto | awk -F. '/ Q   --/{printf $1" " }' | xargs qdel
+
done < paths.txt
List all working nodes in the queue
+
wait #necessary to prevent the script from ending and terminating your jobs
pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node
 
 
 
set NODELIST=`pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node`
 
foreach node ($NODELIST)
 
/usr/local/torque-2.1.6/bin/qsub -l nodes=${NODE} ${HOME}/get.nodes.stats.csh
 
done
 
 
 
Run on a particular type of node
 
  qsub -l nodes=1:beta:ppn=1 ${script}
 

Latest revision as of 15:40, 22 February 2024


Basic Slurm commands

  • squeue -u $(whoami) - displays your running jobs and their job ids
  • scancel <jobid> - remove the job from the queue


Advanced Tricks

Delete all your jobs (use with caution)

scancel -u $(whoami)

Delete all your queued jobs only. Leaves all runnings jobs alone.

scancel -u $(whoami) -t "PENDING" 

View only idle nodes - displays available nodes (note: some nodes may have exclusive access to certain lab groups)

sinfo -l -a | grep idle 

Check out a node to run jobs interactively (so that you don't use the login nodes!!!)

srun -N 1 -n 28 -t 8:00:00 -p long-28core --pty bash

Sample slurm script for many independent jobs

#!/bin/bash
#SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
#SBATCH --nodes=3 #number
#SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
#SBATCH --job-name=<your_job_name>
#SBATCH --output=<std_out_filename>
#SBATCH -p <partition_name>  
 
DOCK_PATH="<path to dock bin/executable>"
JOB_FILE="<your common input file>"
CDIR=$(pwd)
# Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
while IFS= read -r path; do
  cp $JOB_FILE -t $path
  cd $path
  base=$(basename -s .in $JOB_FILE)
  # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give   the job a timelimit
  srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
  cd $CDIR
done < paths.txt
wait #necessary to prevent the script from ending and terminating your jobs