Difference between revisions of "Useful Slurm Commands"

From Rizzo_Lab
Jump to: navigation, search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Sample slurm script for many independent jobs==
 
  #!/bin/bash
 
  #SBATCH --time=<desired time up to node upper time limit> #lower times get higher priority in the queuing system
 
  #SBATCH --nodes=3
 
  #SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
 
  #SBATCH --job-name=<your_job_name>
 
  #SBATCH --output=<std_out_filename>
 
  #SBATCH -p <partition_name>
 
  
  DOCK_PATH="<path to dock bin/executable>"
 
  JOB_FILE="<your common input file>"
 
  CDIR=$(pwd)
 
  
  # Assuming your paths to experiments are listed line by line in a file named paths.txt
 
  while IFS= read -r path; do
 
    cp $JOB_FILE -t $path
 
    cd $path
 
   
 
  
    base=$(basename -s .in $JOB_FILE)
+
==Basic Slurm commands==
    # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give the job a timelimit
 
    srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
 
  
    cd $CDIR
+
* squeue -u $(whoami) - displays your running jobs and their job ids
done < paths.txt
+
* scancel <jobid> - remove the job from the queue
wait #necessary to prevent the script from ending and terminating your jobs
 
  
  
==Basic PBS commands==
+
==Advanced Tricks==
 +
Delete all your jobs (use with caution)
 +
scancel -u $(whoami)
  
* qsub <script> - will submit your script to the queue
+
Delete all your queued jobs only. Leaves all runnings jobs alone.
* qstat - will display a list of jobs running
+
scancel -u $(whoami) -t "PENDING"
* qdel <jobid> - remove the job from the queue
 
  
Job id                    Name            User            Time Use S Queue
+
View only idle nodes - displays available nodes (note: some nodes may have exclusive access to certain lab groups)
  ------------------------- ---------------- --------------- -------- - -----
+
  sinfo -l -a | grep idle
514722.nagling            STDIN            liuxt12        1657:13: R batch
 
514723.nagling            STDIN            liuxt12        3366:45: R batch
 
514724.nagling            jet3dKn0.002PR20 wli            874:42:4 R batch
 
514725.nagling            ...dKn0.002PR100 wli            855:59:2 R batch
 
514803.nagling            jet3dKn0.002PR10 wli            596:48:0 R batch
 
514809.nagling            SAM              mriessen        542:37:3 R batch
 
514811.nagling            STDIN            justin          00:00:32 R batch
 
514815.nagling            latency_test    penzhang              0 Q batch
 
514822.nagling            flex_1.dock.csh  xinyu          34:26:46 R batch
 
514839.nagling            flex_2.dock.csh  xinyu          31:22:33 R batch
 
514856.nagling            flex_5.dock.csh  xinyu          26:33:35 R batch
 
514859.nagling            ....1_2.dock.csh xinyu          06:13:43 R batch
 
514862.nagling            p0.10            hjli            326:10:5 R batch
 
514863.nagling            p0.4            hjli            325:49:5 R batch
 
514864.nagling            p0.7            hjli            332:11:4 R batch
 
514884.nagling            ...1_p0.0_11.csh xinyu          17:16:57 R batch
 
514920.nagling            STDIN            lli            00:34:43 R batch
 
  
On seawulf the 'nodes' command shows a list of free nodes
+
Check out a node to run jobs interactively (so that you don't use the login nodes!!!)
 +
srun -N 1 -n 28 -t 8:00:00 -p long-28core --pty bash
  
        avail  alloc  down
+
==Sample slurm script for many independent jobs==
  nodes:  35      165    25
+
  #!/bin/bash
wulfie: 35      165    25
+
  #SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
 
+
  #SBATCH --nodes=3 #number
Use the man or info command in Unix to get more details of usage for these commands.
+
  #SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
You can use PBS commands inside your script. These are usually optional, but can be useful.
+
  #SBATCH --job-name=<your_job_name>
There is an example script see [[NAMD on Seawulf]]
+
  #SBATCH --output=<std_out_filename>
 
+
  #SBATCH -p <partition_name> 
  #PBS -l nodes=8:ppn=2
+
 
#PBS -l walltime=08:00:00
+
  DOCK_PATH="<path to dock bin/executable>"
  #PBS -N my_job_name
+
JOB_FILE="<your common input file>"
  #PBS -M user@ic.sunysb.edu
+
CDIR=$(pwd)
  #PBS -o namd.md01.out
+
# Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
  #PBS -e namd.md01.err
+
  while IFS= read -r path; do
  #PBS -V
+
  cp $JOB_FILE -t $path
 
+
  cd $path
Using these PBS directives will name the job, the output files etc.
+
  base=$(basename -s .in $JOB_FILE)
  #PBS -j oe
+
  # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give   the job a timelimit
  #PBS -o pbs.out   
+
  srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
This will join the output and error streams into the output file.
+
  cd $CDIR
$PBS_O_WORKDIR is an environment variable that contains the path the script was submitted in. Usually you want to define a specific workdir and use that instead of relying on this variable.
+
done < paths.txt
 
+
wait #necessary to prevent the script from ending and terminating your jobs
==Advanced Tricks==
 
Delete all your jobs (either will work, check different versions of PBS)
 
  qstat -u sudipto | awk -F. '/sudipto/{print $1}' | xargs qdel
 
qstat | grep sudipto | awk -F. '{print $1}' | xargs qdel
 
Delete all your queued jobs only. Leaves all runnings jobs alone.
 
qstat -u sudipto | awk -F. '/00 Q /{printf $1" "}' | xargs qdel
 
qstat -u sudipto | awk -F. '/ Q   --/{printf $1" " }' | xargs qdel
 
List all working nodes in the queue
 
pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node
 
 
 
set NODELIST=`pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node`
 
foreach node ($NODELIST)
 
/usr/local/torque-2.1.6/bin/qsub -l nodes=${NODE} ${HOME}/get.nodes.stats.csh
 
done
 
 
 
Run on a particular type of node
 
  qsub -l nodes=1:beta:ppn=1 ${script}
 

Latest revision as of 15:40, 22 February 2024


Basic Slurm commands

  • squeue -u $(whoami) - displays your running jobs and their job ids
  • scancel <jobid> - remove the job from the queue


Advanced Tricks

Delete all your jobs (use with caution)

scancel -u $(whoami)

Delete all your queued jobs only. Leaves all runnings jobs alone.

scancel -u $(whoami) -t "PENDING" 

View only idle nodes - displays available nodes (note: some nodes may have exclusive access to certain lab groups)

sinfo -l -a | grep idle 

Check out a node to run jobs interactively (so that you don't use the login nodes!!!)

srun -N 1 -n 28 -t 8:00:00 -p long-28core --pty bash

Sample slurm script for many independent jobs

#!/bin/bash
#SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
#SBATCH --nodes=3 #number
#SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
#SBATCH --job-name=<your_job_name>
#SBATCH --output=<std_out_filename>
#SBATCH -p <partition_name>  
 
DOCK_PATH="<path to dock bin/executable>"
JOB_FILE="<your common input file>"
CDIR=$(pwd)
# Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
while IFS= read -r path; do
  cp $JOB_FILE -t $path
  cd $path
  base=$(basename -s .in $JOB_FILE)
  # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give   the job a timelimit
  srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
  cd $CDIR
done < paths.txt
wait #necessary to prevent the script from ending and terminating your jobs