Difference between revisions of "2012 DOCK tutorial with Streptavidin"
Stonybrook (talk | contribs) (→Running DOCK in Parallel using MPI) |
Stonybrook (talk | contribs) (→Running DOCK in Serial) |
||
Line 399: | Line 399: | ||
Typically you will use one processor on a single node to dock one ligand. If you are docking multiple ligands, you can use more than processor in parallel mode, but you should never use more processors than you have ligands. | Typically you will use one processor on a single node to dock one ligand. If you are docking multiple ligands, you can use more than processor in parallel mode, but you should never use more processors than you have ligands. | ||
− | ===Running DOCK in Serial=== | + | ===Running DOCK in Serial on a Single Processor=== |
The following sample code can be used to run DOCK on one processor on a single node: | The following sample code can be used to run DOCK on one processor on a single node: |
Revision as of 10:19, 27 February 2012
For additional Rizzo Lab tutorials see DOCK Tutorials.
Use this link Wiki Markup as a reference for editing the wiki.
Contents
- 1 I. Introduction
- 2 II. Preparing the Receptor and Ligand
- 3 III. Generating Receptor Surface and Spheres
- 4 IV. Generating Box and Grid
- 5 V. Docking a Single Molecule for Pose Reproduction
- 6 VI. Virtual Screening
- 7 VII. Running DOCK in Serial and in Parallel on Seawulf
- 8 VIII. Frequently Encountered Problems
I. Introduction
DOCK
DOCK is a molecular docking program used in drug discovery. It was developed by Irwin D. Kuntz, Jr. and colleagues at UCSF (see UCSF DOCK). This program, given a protein binding site and a small molecule, tries to predict the correct binding mode of the small molecule in the binding site, and the associated binding energy. Small molecules with highly favorable binding energies could be new drug leads. This makes DOCK a valuable drug discovery tool. DOCK is typically used to screen massive libraries of millions of compounds against a protein to isolate potential drug leads. These leads are then further studied, and could eventually result in a new, marketable drug. DOCK works well as a screening procedure for generating leads, but is not currently as useful for optimization of those leads.
DOCK 6 uses an incremental construction algorithm called anchor and grow. It is described by a three-step process:
- Rigid portion of ligand (anchor) is docked by geometric methods.
- Non-rigid segments added in layers; energy minimized.
- The resulting configurations are 'pruned' and energy re-minimized, yielding the docked configurations.
Streptavidin & Biotin
Streptavidin is a tetrameric prokaryote protein that binds the co-enzyme biotin with an extremely high affinity. The streptavidin monomer is composed of eight antiparallel beta-strands which folds to give a beta barrel tertiary structure. A biotin binding-site is located at one end of each β-barrel, which has a high affinity as well as a high avidity for biotin. Four identical streptavidin monomers associate to give streptavidin’s tetrameric quaternary structure. The biotin binding-site in each barrel consists of residues from the interior of the barrel, together with a conserved Trp120 from neighbouring subunit. In this way, each subunit contributes to the binding site on the neighboring subunit, and so the tetramer can also be considered a dimer of functional dimers.
Biotin is a water soluble B-vitamin complex which is composed of an ureido (tetrahydroimidizalone) ring fused with a tetrahydrothiophene ring. It is a co-enzyme that is required in the metabolism of fatty acids and leucine. It is also involved in gluconeogenisis.
Organizing Directories
While performing docking, it is convenient to adopt a standard directory structure / naming scheme, so that files are easy to find / identify. For this tutorial, we will use something similar to the following:
~username/AMS536/DOCK-Tutorial/00-original-files/ /01-dockprep/ /02-surface-spheres/ /03-box-grid/ /04-dock/ /05-virtual-screen/
The following sections in this tutorial will refer back to files within these directories.
II. Preparing the Receptor and Ligand
Downloading the PDB Structure (1DF8)
The Protein Data Bank contains atomic coordinates for more than 79,000 molecules and is accessible via the internet from PDB. The target (protein or nucleic acid) is accessible by PDB ID, molecular name, or author. After entering the PDB ID, molecule name, or author, click Download Files, then select PDB File (text). A window will open; click on Save File → Downloads.
Preparing the Ligand and Enzyme in Chimera
If you are on "herbie" you can access Chimera directly by typing in Chimera at the prompt. Otherwise, download Chimera to your desktop and obtain your protein/nucleic acid complex of interest Fetch (by ID).
Open Your Protein of Interest and Delete Unwanted Molecules:
Click on Open under the File menu and find where you saved your pdb file.
You only need one of the monomers to perform docking. Remove chain B by carrying out the following:
- Select → Chain → B → All.
- Actions →Atoms → Delete.
To delete water molecules and/or other ligands, go to Tools → Structure Edit → Dock Prep. Check all boxes and click okay. This will walk you through the steps needed to prepare the complex for docking, and will also assign partial charges to the protein and the ligand. Choose am 1-bcc for charges.
Save the file as 1DF8.dockprep.mol2.
This file contains a conformation of the complex with hydrogen atoms. The grid calculation is based on the receptor with its hydrogen atoms. The grid score is an energy calculation that is based on the following equation:
E GRID = E VDW + E ES. The score is an approximation of the molecular mechanics' energy function and it considers only through space interactions.
Create a Receptor File:
Go to Select → Residue → BTN. Then go to Actions → Atoms → Delete.
Save the file as 1DF8.receptor.mol2.
Create a Receptor File with No Hydrogen Atoms:
Go to Select → Chemistry → Element → H → Delete.
Save a PDB file as 1DF8.receptor.noH.pdb.
Create a Ligand File:
Open only the 1DF8.dockprep.mol2.
Go to Select → Chain → Delete. This will allow you to have only the ligand.
Save the file as 1DF8.ligand.mol2.
These are the files that you will need to continue with rigid or flexible docking.
III. Generating Receptor Surface and Spheres
Receptor Surface
To generate an enzyme surface, first open the receptor pdb file with the hydrogen atoms removed (1DF8.receptor.noH.pdb). Next, go to Actions -> Surface -> Show. Note that for DOCK calculation hydrogen atoms are considered, but for generating enzyme surface and spheres, it is necessary to use the protein without hydrogens.
Recent versions of Chimera include a Write DMS tool that facilitates calculation of the molecular surface. Go to Tools -> Structure Editing -> Write DMS. Save the surface as 1DF8.receptor.dms.
The Write DMS tool will "roll" a small probe (default radius = 1.4 Angstroms = Size of a water molecule) over the surface of the enzyme and calculate the surface normal at each point. DMS (distributed molecular surface) file is subsequently used as input file for sphgen.
Spheres
To generate docking spheres, we need to use a command line program called sphgen. To run the sphgen, we need a input file named INSPH.
1DF8.receptor.dms #molecular surface file that we got from previous step R #sphere outside of surface (R) or inside surface (L) X #specifies subset of surface points to be used (X=all points) 0.0 #prevents generation of large spheres with close surface contacts(defalut=0.0) 4.0 #maximum sphere radius in angstroms (default=4.0) 1.4 #minimum sphere radius in angstroms (default=radius of probe) 1DF8.receptor.sph #clustered spheres file that we want to generate
Save the INSPH file. Then use this command to generate spheres file:
sphgen -i INSPH -o OUTSPH
-i means input; -o means output.
You should get two output files: OUTSPH and 1DF8.receptor.sph. The OUTSPH file should similar to this:
density type = X reading 1DF8.receptor.dms type R # of atoms = 881 # of surf pts = 10771 finding spheres for 1DF8.receptor.dms dotlim = 0.000 radmax = 4.000 Minimum radius of acceptable spheres? 1.4000000 output to 1DF8.receptor.sph clustering is complete 28 clusters
You can also open the spheres file that we generated in this step (1DF8_receptor.sph). This file contains detailed information of the spheres, which are divided into 28 clusters. Cluster 0 in the en d of the spheres file is a combination of all the clusters.
In order to visualize the generated spheres, you can use a program called showsphere. Showsphere is an interactive program. In the command line, simply type
showsphere
You will be asked a following questions:
Enter name of sphere cluster file: 1DF8.receptor.sph Enter cluster number to process (<0 = all): -1 Generate surfaces as well as pdb files (<N>/Y)? N Enter name for output file prefix: 1DF8.output. Process cluster 0 (contains ALL spheres) (<N>/Y)? N
In this case, 28 pdb files should be generated.
An alternative way to do this is to create an input file, showsphere, as follows:
1DF8.re.sph -1 N 1DF8.output N
Then type the command showsphere
showshpere < showsphere.in
IV. Generating Box and Grid
Box
In order to speed up docking calculations, DOCK generates a fine grid, and at each point in the grid electrostatic and a VDW probes' energies are precomputed. The energies are computed using a molecular force field. To determine the dimensions of the grid, however, we first generate a box that contains the outer boundaries for grid calculation. The dimensions and location of the box can be determined using a program called showbox.
First create a directory where you will place the grid files.
$mkdir 03-box-grid $cd 03-box-grid
showbox can be used interactively or a file with predetermined answers can be fed into the program.
The program asks the questions depicted in the diagram the right:
To run the program in the interactive mode, run
$showbox
To feed the answers to the questions, run
$showbox < showbox.in
For example, showbox.in can contain:
Y 5 ../02-surface-spheres/selected_spheres.sph 1 1DF8.box.pdb
Y means we use automatic box construction, 5 is the extra margin to be enclosed around our ligand (in Angstroms), selected_spheres.sph is the sphere file we generated, 1 corresponds to the cluster number in the selected_spheres.sph file, and 1DF8.box.pdb is the output file. We can open the output box file in chimera to make sure the box is in the right place.
Grid
Now let's generate a grid within our box. We will use the energy scoring method to generate a grid, resulting in three additional files with extensions *.nrg, *.bmp, and *.out. The *.nrg file contains the energy scoring, *.bmp contains the size, position and grid spacing and determines whether there are any overlap with receptor atoms.
To generate the grid we will use the grid program. This program can either be used interactively, or an input file can be fed in, just like the showbox program.
Usage: grid [-i [input_file]] [-o [output_file]] ... [-standard_i/o] [-terse] [-verbose] -i: read from grid.in or input_file, standard_in otherwise -o: write to grid.out or output_file (-i required), standard_out otherwise -s: read from and write to standard streams (-i and/or -o illegal) -t: terse program output -v: verbose program output
For our grid.in file, we will use the following answers:
compute_grids yes grid_spacing 0.3 output_molecule no contact_score no energy_score yes energy_cutoff_distance 9999 atom_model a attractive_exponent 6 repulsive_exponent 9 distance_dielectric yes dielectric_factor 4 bump_filter yes bump_overlap 0.75 receptor_file ../01-dockprep/1DF8.receptor.mol2 box_file 1DF8.box.pdb vdw_definition_file /opt/software/AMS536software/dock6/parameters/vdw_AMBER_parm99.defn score_grid_prefix grid
Line by line:
- compute scoring grids (yes)
- what is the distance between grid points along each axis (in Angstroms).
- write up coordinates of the receptor into a new file
- compute contact grid? default is no
- compute energy score? yes - we are using this method to compute force fields on probes
- the max distance between atoms for the energy contribution to be computed
- atom_model u means united atom model where atoms are attached to hydrogens, and a stands for all-atom model, where hydrogens on carbons are treated separately
- attractive component stands for exponent of the attractive LJ term in VDW potential
- repulsive component stands for exponent in the repulsive LJ term in VDW potential
- distance dielectric stands for the dielectric constant to be linearly dependent on distance
- distance dielectric factor is the coefficient of the dielectric
- bump filter flag determines if we want to screen orientation for clashes before scoring and minimization
- bump_overlap stands for the fraction of allowed overlap where 1 corresponds to no allowed overlap and 0 corresponds to full overlap being permitted.
- our receptor file
- the box file we generated in the Box section
- VDW parameters file
- Prefix for the grid file name. All the extensions will be generated automatically.
V. Docking a Single Molecule for Pose Reproduction
Docking and Results
Change directory into “04-dock” and create an empty input file called dock.in
$touch dock.in #create a file called “dock.in”
Run dock6.4
$dock6 -i dock.in
Alternatively:
$dock6 -i dock -v # “-v” option allows you to print out the information regarding each growth step on terminal
Or:
$dock6 -i dock -o dock.bf.out -v # “-o” allows you to write the aforementioned information into an output file named “dock.bf.out”
1st run:
Notice that running dock6 with an empty input file will require you to answer a series of questions. For the first run we will deactivate most of the features by selecting “no”. Parameters requiring specification are listed below:
ligand_atom_file [database.mol2] (): ../01-dockprep/1DF8.ligand.mol2 ligand_outfile_prefix [output] (): 1DF8.output
What the program is doing in the 1st run is to take the ligand.mol2 file and directly generate an output file without any additional manipulations. You would expect it to be exactly the same as the original pose. You can open it in Chimera by using the “ViewDock” function under “Tools->Surface/Binding Analysis”. We will not show the result here.
2nd run:
The real experiment begins here. Notice that we selected “no” for most of the functions. This time we will try change some parameters in the dock.in file.
$vim dock.in
Parameters being changed are listed below:
orient_ligand yes calculate_rmsd no
Note that because of these parameters, additional questions are asked by the program. For simplicity we keep most of the answers as default. File paths requiring specifying are listed below:
receptor_site_file [receptor.sph] ():../02-surface-spheres/selected_spheres_06A.sph vdw_defn_file [vdw.defn] ():../03-grid/vdw_AMBER_parm99.defn flex_defn_file [flex.defn] ():../03-grid/flex.defn flex_drive_file [flex_drive.tbl] ():../03-grid/flex_drive.tbl
The result is shown below. As you may notice the result doesn't look very good.
3rd Run:
Here we will further specify parameters to improve the result.
calculate_rmsd yes score_molecules yes num_scored_conformers 10
Again, further questions will be asked when you run the program. Keep most answers as default. Paths requiring specification are listed below:
grid_score_grid_prefix [grid] ():../03-grid/1DF8.grid
Note that the above specification will tell the program to load the 1DF8.grid.nrg file you generated in the previous step (grid) for scoring.
simplex_max_iterations [1000] ():20 write_conformations [yes] (yes no):no cluster_rmsd_threshold [2.0] ():0.2
Note the program will cluster poses that are very close together (rmsd smaller than the threshold specified) into a cluster.
This time the program returned the best 10 poses. The one with the best grid score (-52.8, shown blue) superimposed quite well with the original pose in the crystal structure (RMSD = 0.71). You can view the grid scores in ViewDock by selecting “Column->Show->Grid Score”
4th Run:
This time we will set the ligand as flexible:
flexible_ligand yes
Further questions will be asked during the run. Keep most of the answers as default except for:
simplex_grow_max_iterations [20] ():500
Notice that the run is significantly slower this time. Again the best pose generated (-64.8) is shown in blue. The grid score improved significantly but the RMSD did not change much (0.79)
5th Run:
This time we will turn the bump filter on:
bump_filter yes
Further questions will be asked during the run. Keeps answers as default. Specify the following path:
bump_grid_prefix [grid] ():../03-grid/1DF8.grid
Note that the path tells the program to access the .bmp file generated in the previous grid step.
The best pose (grid score -65.3) is again shown in blue. Notice that turning on the bump filter does not alter either the grid score or the RMSD (0.78).
Remember you can tweak any parameter in the dock.in file instead of keeping them as default.
Now we will write up a script for submitting your dock job to Seawulf. Create a script called “sub.dock.csh”
$vim sub.dock.csh
wherein you write:
#! /bin/tcsh #PBS -l nodes=1:ppn=1 #PBS -l walltime=01:00:00 #PBS -o zzz.qsub.out #PBS -j oe #PBS -V set nprocs = `wc -l $PBS_NODEFILE | awk '{print $1}'`
echo "Running on ${nprocs} processors"
cd ~/AMS536/DOCK_tutorial/05-dock_qsub
dock6 -i dock.in -o dock.out
This will request 1 processor from the cluster for your job. When you are submitting the job:
$qsub sub.dock.csh
Note that you might have to make the script executable before running it:
$chmod +x sub.dock.csh
Results
VI. Virtual Screening
Virtual Screening Preparation
Virtual Screening Protocol
Virtual Screening Results
VII. Running DOCK in Serial and in Parallel on Seawulf
The Seawulf Cluster has 235 dual processor nodes, with 2 processors per node, for a total of 470 individual processors. These are 3.4Ghz Intel Pentium IV Xeon processors from Dell. Each node has 2GB attached RAM and a 40GB hard disk.
Typically you will use one processor on a single node to dock one ligand. If you are docking multiple ligands, you can use more than processor in parallel mode, but you should never use more processors than you have ligands.
Running DOCK in Serial on a Single Processor
The following sample code can be used to run DOCK on one processor on a single node:
#!/bin/tcsh #PBS -l nodes=1:ppn=1 #PBS -l walltime=01:00:00 #PBS -N dock6 #PBS -M user@ic.sunysb.edu #PBS -j oe #PBS -o pbs.out cd /nfs/user03/sudipto/DOCK_Tutorial /nfs/user03/sudipto/dock6/bin/dock6 -i dock.in -o dock.out
Here is an explanation of the code and format:
#!/bin/tcsh #Execute script with tcsh #PBS -l nodes=1:ppn=1 #Use one node, and one processor per node, so one single processor #PBS -l walltime=01:00:00 #Allow 1 hour for your job run #PBS -N dock6 #Name of your job #PBS -M user@ic.sunysb.edu #Get an email notifying you when your job is completed #PBS -j oe #Combine the output and error files into a single output #PBS -o pbs.out #Name of your output file cd /path-to-your-home-directory-on-seawulf/DOCK_Tutorial #Change to your home directory and folder with dock files /nfs/user03/sudipto/dock6/bin/dock6 -i dock.in -o dock.out #Specifies path to dock executable and provide input and output filenames
Running DOCK in Parallel using MPI
The following sample code can be used to run DOCK on 4 nodes, using both processors on each node, for a total of 8 processors.
#! /bin/tcsh #PBS -l nodes=4:ppn=2 #PBS -l walltime=24:00:00 #PBS -o zzz.qsub.out #PBS -j oe #PBS -V set nprocs = `wc -l $PBS_NODEFILE | awk '{print $1}'` echo "Running on ${nprocs} processors" cd /nfs/user03/sudipto/DOCK_Tutorial cp /nfs/user03/sudipto/dock6/parameters/vdw_AMBER_parm99.defn . cp /nfs/user03/sudipto/dock6/parameters/flex* . mpirun -np $nprocs /nfs/user03/sudipto/dock6/bin/dock6.mpi -i dock.in -o dock.out
For more information, see PBS Queue