Difference between revisions of "Useful Slurm Commands"

From Rizzo_Lab
Jump to: navigation, search
(Created page with "==Sample qsub script== #!/bin/csh #PBS -l nodes=1:ppn=2 #PBS -l walltime=24:00:00 #PBS -N my_job_name #PBS -M user@ic.sunysb.edu #PBS -j oe #PBS -o pbs.out cd /nfs...")
 
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Sample qsub script==
 
#!/bin/csh
 
#PBS -l nodes=1:ppn=2
 
#PBS -l walltime=24:00:00
 
#PBS -N my_job_name
 
#PBS -M user@ic.sunysb.edu
 
#PBS -j oe
 
#PBS -o pbs.out 
 
 
cd /nfs/user03/sudipto
 
ls
 
echo "Hello World"
 
echo "My first seawulf job"
 
  
==Basic PBS commands==
 
  
* qsub <script> - will submit your script to the queue
 
* qstat - will display a list of jobs running
 
* qdel <jobid> - remove the job from the queue
 
  
Job id                    Name            User            Time Use S Queue
+
==Basic Slurm commands==
------------------------- ---------------- --------------- -------- - -----
 
514722.nagling            STDIN            liuxt12        1657:13: R batch
 
514723.nagling            STDIN            liuxt12        3366:45: R batch
 
514724.nagling            jet3dKn0.002PR20 wli            874:42:4 R batch
 
514725.nagling            ...dKn0.002PR100 wli            855:59:2 R batch
 
514803.nagling            jet3dKn0.002PR10 wli            596:48:0 R batch
 
514809.nagling            SAM              mriessen        542:37:3 R batch
 
514811.nagling            STDIN            justin          00:00:32 R batch
 
514815.nagling            latency_test    penzhang              0 Q batch
 
514822.nagling            flex_1.dock.csh  xinyu          34:26:46 R batch
 
514839.nagling            flex_2.dock.csh  xinyu          31:22:33 R batch
 
514856.nagling            flex_5.dock.csh  xinyu          26:33:35 R batch
 
514859.nagling            ....1_2.dock.csh xinyu          06:13:43 R batch
 
514862.nagling            p0.10            hjli            326:10:5 R batch
 
514863.nagling            p0.4            hjli            325:49:5 R batch
 
514864.nagling            p0.7            hjli            332:11:4 R batch
 
514884.nagling            ...1_p0.0_11.csh xinyu          17:16:57 R batch
 
514920.nagling            STDIN            lli            00:34:43 R batch
 
  
On seawulf the 'nodes' command shows a list of free nodes
+
* squeue -u $(whoami) - displays your running jobs and their job ids
 +
* scancel <jobid> - remove the job from the queue
  
        avail  alloc  down
 
nodes:  35      165    25
 
wulfie: 35      165    25
 
  
Use the man or info command in Unix to get more details of usage for these commands.
+
==Advanced Tricks==
You can use PBS commands inside your script. These are usually optional, but can be useful.
+
Delete all your jobs (use with caution)
There is an example script see [[NAMD on Seawulf]]
+
scancel -u $(whoami)
  
#PBS -l nodes=8:ppn=2
+
Delete all your queued jobs only. Leaves all runnings jobs alone.  
#PBS -l walltime=08:00:00
+
  scancel -u $(whoami) -t "PENDING"
#PBS -N my_job_name
 
#PBS -M user@ic.sunysb.edu
 
#PBS -o namd.md01.out
 
  #PBS -e namd.md01.err
 
#PBS -V
 
  
Using these PBS directives will name the job, the output files etc.
+
View only idle nodes - displays available nodes (note: some nodes may have exclusive access to certain lab groups)
#PBS -j oe
+
  sinfo -l -a | grep idle
#PBS -o pbs.out   
 
This will join the output and error streams into the output file.
 
$PBS_O_WORKDIR is an environment variable that contains the path the script was submitted in. Usually you want to define a specific workdir and use that instead of relying on this variable.
 
 
 
==Advanced Tricks==
 
Delete all your jobs (either will work, check different versions of PBS)
 
  qstat -u sudipto | awk -F. '/sudipto/{print $1}' | xargs qdel
 
qstat | grep sudipto | awk -F. '{print $1}' | xargs qdel
 
Delete all your queued jobs only. Leaves all runnings jobs alone.
 
qstat -u sudipto | awk -F. '/00 Q /{printf $1" "}' | xargs qdel
 
qstat -u sudipto | awk -F. '/ Q  --/{printf $1" " }' | xargs qdel
 
List all working nodes in the queue
 
pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node
 
  
set NODELIST=`pbsnodes | egrep '^node|state =' | grep -B 1 'state = free' | grep ^node`
+
Check out a node to run jobs interactively (so that you don't use the login nodes!!!)
  foreach node ($NODELIST)
+
  srun -N 1 -n 28 -t 8:00:00 -p long-28core --pty bash
/usr/local/torque-2.1.6/bin/qsub -l nodes=${NODE} ${HOME}/get.nodes.stats.csh
 
done
 
  
Run on a particular type of node
+
==Sample slurm script for many independent jobs==
   qsub -l nodes=1:beta:ppn=1 ${script}
+
#!/bin/bash
 +
#SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
 +
#SBATCH --nodes=3 #number
 +
#SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
 +
#SBATCH --job-name=<your_job_name>
 +
#SBATCH --output=<std_out_filename>
 +
#SBATCH -p <partition_name> 
 +
    
 +
DOCK_PATH="<path to dock bin/executable>"
 +
JOB_FILE="<your common input file>"
 +
CDIR=$(pwd)
 +
# Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
 +
while IFS= read -r path; do
 +
  cp $JOB_FILE -t $path
 +
  cd $path
 +
  base=$(basename -s .in $JOB_FILE)
 +
  # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give  the job a timelimit
 +
  srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
 +
  cd $CDIR
 +
done < paths.txt
 +
wait #necessary to prevent the script from ending and terminating your jobs

Latest revision as of 15:40, 22 February 2024


Basic Slurm commands

  • squeue -u $(whoami) - displays your running jobs and their job ids
  • scancel <jobid> - remove the job from the queue


Advanced Tricks

Delete all your jobs (use with caution)

scancel -u $(whoami)

Delete all your queued jobs only. Leaves all runnings jobs alone.

scancel -u $(whoami) -t "PENDING" 

View only idle nodes - displays available nodes (note: some nodes may have exclusive access to certain lab groups)

sinfo -l -a | grep idle 

Check out a node to run jobs interactively (so that you don't use the login nodes!!!)

srun -N 1 -n 28 -t 8:00:00 -p long-28core --pty bash

Sample slurm script for many independent jobs

#!/bin/bash
#SBATCH --time=2-00:00:00  #lower times get higher priority in the queuing system
#SBATCH --nodes=3 #number
#SBATCH --ntasks=(# cores per node) * (# nodes) #this should equal the number of jobs you want to run at the same time
#SBATCH --job-name=<your_job_name>
#SBATCH --output=<std_out_filename>
#SBATCH -p <partition_name>  
 
DOCK_PATH="<path to dock bin/executable>"
JOB_FILE="<your common input file>"
CDIR=$(pwd)
# Assuming your paths to experiments are listed line by line in a file named paths.txt, which should have # paths = # cores = # tasks
while IFS= read -r path; do
  cp $JOB_FILE -t $path
  cd $path
  base=$(basename -s .in $JOB_FILE)
  # You can modify this srun command based on your requirements, the -n1 requests 1 core for the srun job, the -N1 requests 1 node, the --exclusive prevents this job's cores from being used in other jobs, -W 0 tells the script to not give   the job a timelimit
  srun --mem=6090 --exclusive -N1 -n1 -W 0 $DOCK_PATH/dock6.rdkit -i $base.in -o $base.out &
  cd $CDIR
done < paths.txt
wait #necessary to prevent the script from ending and terminating your jobs