Difference between revisions of "Test Set Construction SB2024 V1 DOCK6.10 A"

Latest revision as of 13:13, 14 July 2025

The purpose of this tutorial is to develop a uniform method for adding systems to the Rizzo Lab test set, and rebuilding the set from initial files. This is the protocol to be used by all lab members. If the protocol is changed, a new tutorial should be made, preserving this tutorial.

1 I.Introduction
2 II. Preliminary File Preparation
3 III.Scripts for System Preparation
4 III.A. Sourcing Variables
5 III.B. Ligand Preparation
6 III.C. Receptor Preparation
7 III.D. Sphere Generation
8 III.E. Single Grid Generation
9 III.F. Multi Grid Generation
10 III.G. Bookkeeping of Heavy Atoms
11 III.H. Interpreting Bookkeeping Spreadsheet
12 IV.Test Cases for Preparation Integrity
13 V.Organizing and Protecting Files
14 VI. Notes on Different Folders
15 VII. Notes on Editing/ Correcting Set
16 VIII.To Do

I.Introduction

A key development for this tutorial is bookkeeping scripts to manage the count of heavy atoms during processing and indicate unaccounted changes (II.G). Test cases are also included which should be visually inspected for indicated elements to be sure that processing occurred as expected (III).

    It is critical ALL of these steps are completed.

Careful consideration should be given before adding new system to given test set. Some key criteria include, but are not limited to:

(1) There are no incomplete residues or loops within 8 Angstroms of ligand.

(2) There are no non-standard residues within 8 Angstroms of ligand, except for MSE, CME, CSD, CSO, CCS, SME and MHO which will be converted to standard residues in the script fix_nonstandard residues. An example of a PDB which should NOT be included is 1A8I, which has non-standard residue LLP within 8 Angstroms of ligand for docking.

II. Preliminary File Preparation

The receptor and ligand should be prepared from the initial PDB file as outlined in student provided VS tutorials

    https://ringo.ams.stonybrook.edu/index.php/DOCK_VS_Tutorials

Briefly:

(1) The ligand should have hydrogens added with careful attention to protonation state and gasteiger charges added. The extracted ligand is saved as "${pdb_id}.lig.moe.mol2". IF the ligand is already part of a protein family and is part of a congeneric series already existing in family, the protonation should be consistent with the family unless literature suggests otherwise

(2) The receptor is saved in isolation. The biologically relevant oligomerization of the protein should be saved, such as for HIV Reverse Transcriptase, the dimer should be saved. Generally any water molecules are deleted, except specialized experiments. Any monoatomic ions within 8 angstroms of the ligand are retained as part of the receptor. The protocols have prep and frcmod files integrated for Heme cofactors, so these should be retained as part of the receptor. Once the receptor, metal ions, and heme cofactor are extracted they are saved together as "${pdb_id}.rec.foramber.pdb"

(3) Any biologically relevant organic small molecule cofactor (such as NADH / NADPH) other than Heme should be treated as a ligand (1) and saved as "${pdb_id}.cof.moe.mol2".

(4) Files from (1) (2) and (3) should be moved to the folder zzz.master in the upper folder Testset_Protocols from the github repo.

For further clarification, review the student tutorials as indicated.

III.Scripts for System Preparation

Each script should process the system in a fully automated manner. These scripts are located at:

    https://github.com/rizzolab/Testset_Protocols.git

There are key elements to check after each script before proceeding to the next. The output to be checked from each step is written to a folder "zzz.outfiles" with a unique file for each system, at each step. At the top of this file is the netid of the user who ran each process.

These scripts are designed to run in parallel on an HPC. Each step below has approximate timing indicated. If running 100 structures on a step with a timing of 10 minutes, at least 1000 cpu-minutes should be requested. There are averages and individual systems may take much less or more time.

III.A. Sourcing Variables

There are a few variables that need to be set before running preparation scripts.

    vi run.000.set_env_vars.sh

The following parameters should be set to the most recent version:

    DOCKHOMEWORK
    AMBERHOMEWORK

Every time a new session is started the following command needs to be run:

    source run.000.run.000.set_env_vars.sh

III.B. Ligand Preparation

    Timing=10 minutes

This step will assign am1bcc charges based on the charge in the preliminary file ${pdb_id}.lig.moe.mol2 and be saved as ${pdb_id}.lig.am1bcc.mol2

DOCK6 is required for this step so load appropriate compilers:

   module load intel-stack

In the script submit.run.001.sh, if using a list of PDB codes other than "zzz.lists/clean.systems.all", change this in the script first.

    sbatch submit.run.001.sh

After the job has completed check to see if any of the ligands did not process successfully.

    bash run.001b.check.sh

Any ligands or cofactors which did not pass will be listed in:

    zyy.01.lig_missing.dat / zyy.01.cof_missing.dat

The output for each system is written to file with the netid of the user at the top:

    vi zzz.outfiles/${pdb_id}.001.lig.out

These files should be checked for the phrase Fatal. This may mean the ligand failed or may have another issue. It does not mean with certainty the molecule is wrong, but should be inspected.

    grep "Fatal" zzz.outfiles/*001.lig.out

Output with Fatal Error (yellow) and user netid (top)

Some cofactors require unexpected bond orders to pass antechamber. If there is a correct version of the cofactor with the same res id such as "NAP" in zzz.master which passes antechamber, use this as a template.

    python zzz.scripts/bond_order_diff.py ${pdb_id_1}.cof.moe.mol2 ${pdb_id_2}.cof.moe.mol2

This assumes atom names are equivalent for the mol2 (Hydrogen can be disregarded). Any bonds with different orders are printed, preserving the order input at command line (first argument prints on left, second argument prints on right). The script does not fix the bond orders for you, but helps quickly assigns how to match template.

Output from bond_order_diff.py shows difference in bond order

Other common problems are with carboxylic acid and nitro functional groups. It is best to assign these functional groups "ar" bond types.

(NOTE TO EDITOR- Mention swapaa for alanine script and include on github)

III.C. Receptor Preparation

    Timing=1 minute

    sbatch submit.run.002.sh

After completion check for missing systems:

    bash run.002b.check.sh

Any complexes which did not pass will be listed in:

    zyy.02.rec_missing.dat

Check the output file for any large RMSD from amber minimization. Restraints in this step are set high so any RMSD > 0.5 Angstroms should be considered dubious. Any structure RMSD > 2.0 Angstroms should be rejected from the test set.

RMSD of Protein, Ligand and Complex for single system

Due to an error in an Amber translation table P.3 atoms are converted to C.3 atoms in cofactors when written to ${pdb_id}.rec.clean.mol2. At this point it is currently corrected manually, but is to be corrected in Amber23. This includes but not restricted to cofactors "NAP","NAD","NDP". If not using Amber23, please check "COF" ${pdb_id}.rec.clean.mol2 to fix these cases.

Atom type of P1 is C.3 when should be P.3

III.D. Sphere Generation

    Timing=1 minute

    sbatch submit.run.003.sh

After completion check for missing systems:

    bash run.003b.check.sh

Any systems which did not pass will be listed in:

    zyy.03.sph_missing.dat

III.E. Single Grid Generation

    Timing=15 minutes

    sbatch submit.run.004.sh

After completion check for missing systems:

    bash run.004b.check.sh

Any systems which did not pass will be listed in:

    zyy.04.grd_missing.dat

It should be checked that monoatomic ions were successfully integrated into the grid. Changes in naming could cause issues.

    vi ${pdb_id}/004.grid/grid.out

For example, there are 2 Zinc ions expected in the structure 5UPG:

Zinc atoms are present in grid output

III.F. Multi Grid Generation

    Timing=15 minutes

    sbatch submit.run.005.sh

After completion check for missing systems:

    bash run.005b.check.sh

Any systems which did not pass will be listed in:

    zyy.05.mgr_missing.dat

III.G. Bookkeeping of Heavy Atoms

Bookkeeping scripts are to account the count of heavy atoms in the initial files compared to final processed files. There are changes to the receptor during processing which are accounted for. This includes: (I) Heavy atoms added to incomplete residues (II) Dual Occupancy atoms deleted (III) Defined mutation from non-standard residue to standard residue. If all atoms are accounted for, the Overall Balance should be 0. Any system that is unbalanced should be inspected.

There are a minority of false positives that have an Overall Balance not equal to 0: (I) Monoatomic ions other than Zn, Mg, Ca come up unbalanced (II) Some pdb have merged columns (ie residue number and x-coordinates do not have space in between) (III) "N" in Heme creates unaccounted change of -4 (IV) Some nonstandard residues outside the active site are mutated to alanine, which is considered an unaccounted change.

To crosscheck with previous built spreadsheet (To be released):

    UCSF Chimera must be loadable as a module to run bookkeeping scripts! 
    However any method to convert pdb to mol2 can be used instead.

Run the following 3 commands only after the previous has finished.

    sbatch run.006.bookkeeping.sh 
    sbatch run.006b.merge_bookkeeping.sh 
    sbatch run.007.system_stats.sh

Create version of zzz.lists/clean.systems.all which had header "sys"

    vi zzz.lists/clean.systems.all.header

    sys
    PDB1
    PDB2
    PDB3

The 3 files zzz.lists/clean.systems.all.header tmp1.csv and atmcounts_all.csv should all be the same number of lines.

   bash run.008.merge_spreadsheet.sh

This produces the final bookkeeping spreadsheet.

III.H. Interpreting Bookkeeping Spreadsheet

NOTE TO AUTHOR - Bookeeping corrected for Na and Cl but still has issue with Heme, and defined mutation of nonstandard residue. Try to fix these.

The first column that should be checked is "Cofactor Present Prep","Cofactor Present Rec" and "Cofactor Present Grid". These are associated with the files: ${pdb_id}.cof.moe.mol2 , ${pdb_id}.rec.clean.mol2 and grid.out , respectively. If all 3 columns are not in agreement for a given system, something is wrong.

Spreadsheet sample with Cofactor columns

Next sort from largest to smallest on the "Minimized Lig RMSDh" column. This is associated with the file ${pdb_id}.lig.python.min.mol2 . If any system has an RMSD > 2.0 Angstroms it should first be inspected that the ligand and receptor zzz.master files are both in the correct frame. Otherwise the systems should be rejected from the test set, if there is no other error in preparation.

Spreadsheet sample with Minimized ligand RMSD

Likewise now sort from largest to smallest on the "DCEsum" column. If any system has a positive score, treat is as the former step (check for errors, otherwise reject).

Below is an example of a balanced system. Overall Balance column is "0", indicating no unaccounted changes. In the first column "O 1" indicates there is 1 additional Oxygen atom in ${pdb_id}.rec.clean.mol2 than the initial file ${pdb_id}.rec.foramber.pdb. This atom was added to an incomplete sidechain by Amber, and was an accounted change.

Spreadsheet sample with Oxygen added to incomplete sidechain

In the next example there are 21 Carbon, 7 Nitrogen, 14 Oxygen and 2 Phosphorous missing from  ${pdb_id}.rec.clean.mol2 that were originally in ${pdb_id}.rec.foramber.pdb ( including ${pdb_id}.cof.moe.mol2 ). Overall 44 atoms are missing. There are no accounted changes in the next 3 columns ("Heavy atoms added to incomplete residue" , "Dual Occupancy atoms" or "Atoms deleted in defined mutations". Thus the overall balance is " -44 ".  This indicates an issue with preparation.  In this case the cofactor was not included in the final mol2.

Spreadsheet sample with Missing NADH cofactor

In this last case there are multiple accounted changes which take place and are balanced overall. However Na and Cl ions dont balance at the moment (Notice Na -1 in first column). This is why the overall balance is " -1 ".

Spreadsheet sample with Accounted changes but false positive on Na

IV.Test Cases for Preparation Integrity

These are cases which that need to be visually inspected to ensure key operations during preparation were executed. This step is crucial after building the test set in batch mode.

System	Element	Details
1QCF	Phosphotyrosine Residue	Check Residue correctly has Y2P Phosphotyrosine (See Picture below)
2Y03	Disulfide Bond	Check bond Residue 82.SG and Residue 167.SG is actually bonded (See Picture below )
4WMZ	Heme	Check Heme is correctly incorporated, including Iron (Fe)
1P44	NADH Cofactor	Make sure cofactor is present in 1P44.rec.clean.mol2 and grid.out as residue "COF"
5UPG	Zinc Chelated Histidine Protonation	Residue should have HID protonation (See Picture below)
3D94	NonStandard Residue Defined Mutatation	Make sure Residue 146 successfully mutates to Residue "MET" from "MHO"
1JWT	Long Bonds	Make sure TER is added before long bond in 1JWT.rec.clean.pdb (Dont check mol2). This system should have 4 TER.

-

2GQG with Phosphotyrosine

2Y03 Disulfide Bond

5UPG Histidine Correct Protonation

3D94 MHO to MET mutation

-

V.Organizing and Protecting Files

After the testset has been verified on all metrics discussed above, the relevant files for docking can be extracted and put in a minimal test set.

    bash run.009.transfer_files.sh

Then move this file to Rizzo Lab project space:

    mv zzz.SB20XX_Testset /rizzo/lab/project/space

The original prepared files in zzz.master should be left accessible, and should be available to propagate for future versions of the set.

    cp -r zzz.master /rizzo/lab/project/space/zzz.SB20XX_Testset

The outfiles for each build should also be left accessible.

    cp -r zzz.outfiles /rizzo/lab/project/space/zzz.SB20XX_Testset

All original build files should be retained for future reference. Do not delete these files.

    tar -czvf Name-of-tar-file.tar.gz /path/to/testset_scripts 
    (Directory you have been running all build scripts in)

After this has been completed move it to the minimal test set:

    mv Name-of-tar-file.tar.gz /rizzo/lab/project/space/zzz.SB20XX_Testset

Once a testset has been prepared for use by the lab, set all files to be read only. However directories must remain executable:

    chmod -R ugo-wx+Xr /rizzo/lab/project/space/zzz.SB20XX_Testset

VI. Notes on Different Folders

zzz.distribution is a copy of only dock necessary files including, to be made available in the Downloads sections with each major update of the set: $sys.rec.clean.mol2, box.pdb, $sys.lig.am1bcc.mol2, $sys.rec.clust.close.sph

zzz.family_lists is all families with at least 4 systems containing the same Uniprot Recommended Name and meeting alignment criteria.

zzz.master is all initial files which have split original pdb into its individual components, deleted unretained atoms, and protonated ligand and cofactor

zzz.original_pdb is the original PDB fetched from rcsb page without editing. As of now all were fetched on date specified in README, but maybe add date for any added system in README

zzz.outfiles is the output from system processing to terminal screen, saved in files for each step.

zzz.swapped_res indicates which residues have been mutated (due to incomplete sidechains). This is not exhaustive as there is no record from SB2012, and was only included later in the expansion of SB2025.

zzz.system_movies Scripts to generate simple Chimera movies to inspect system set up from zzz.master, compared to original pdb.

zzz.testset_files is all intermediate files generate during system processing.

VII. Notes on Editing/ Correcting Set

A best practice for correcting any errors in a system is to modify the initial files in zzz.master ($sys.rec.foramber.pdb or $sys.lig.moe.mol2).

Modifying a processed file may lead to errors in processing, due to modifications not expected further upstream.

VIII.To Do

1. Integrate prep and frcmod files for cofactors available at http://amber.manchester.ac.uk/
2. Fix false positives in bookkeeping scripts
3. Add bookkeeping of receptor heavy atoms processed in grid.out 
4. More details on Test set life cycle (Where to store old versions, how to make corrections to current version and document changes ...)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Tutorial Written By: Christopher Corbo, Rizzo Lab, Stony Brook University (This tutorial was last updated 02/26/2024)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

@@ Line 5: / Line 5: @@
 A key development for this tutorial is bookkeeping scripts to manage the count of heavy atoms during processing and indicate unaccounted changes (II.G). Test cases are also included which should be visually inspected for indicated elements to be sure that processing occurred as expected (III).
       '''It is critical ALL of these steps are completed.'''
+Careful consideration should be given before adding new system to given test set. Some key criteria include, but are not limited to:
+(1) There are no incomplete residues or loops within 8 Angstroms of ligand.
+(2) There are no non-standard residues within 8 Angstroms of ligand, except for MSE, CME, CSD, CSO, CCS, SME and MHO which will be converted to standard residues in the script fix_nonstandard residues. An example of a PDB which should NOT be included is 1A8I, which has non-standard residue LLP within 8 Angstroms of ligand for docking.
 ==II. Preliminary File Preparation==
@@ Line 13: / Line 19: @@
 Briefly:
-(1) The ligand should have hydrogens added with careful attention to protonation state and gasteiger charges added. The extracted ligand is saved as "${pdb_id}.lig.moe.mol2".
+(1) The ligand should have hydrogens added with careful attention to protonation state and gasteiger charges added. The extracted ligand is saved as "${pdb_id}.lig.moe.mol2".  ''' IF the ligand is already part of a protein family and is part of a congeneric series already existing in family, the protonation should be consistent with the family unless literature suggests otherwise'''
-(2) The receptor is saved in isolation. Generally any water molecules are deleted, except specialized experiments. Any monoatomic ions within 8 angstroms of the ligand are retained as part of the receptor. ''' The protocols have prep and frcmod files integrated for Heme cofactors, so these should be retained as part of the receptor. ''' Once the receptor, metal ions, and heme cofactor are extracted they are saved together as "${pdb_id}.rec.foramber.pdb"
+(2) The receptor is saved in isolation. The biologically relevant oligomerization of the protein should be saved, such as for HIV Reverse Transcriptase, the dimer should be saved. Generally any water molecules are deleted, except specialized experiments. Any monoatomic ions within 8 angstroms of the ligand are retained as part of the receptor. ''' The protocols have prep and frcmod files integrated for Heme cofactors, so these should be retained as part of the receptor. ''' Once the receptor, metal ions, and heme cofactor are extracted they are saved together as "${pdb_id}.rec.foramber.pdb"
 (3) Any biologically relevant organic small molecule cofactor (such as NADH / NADPH) other than Heme should be treated as a ligand (1) and saved as "${pdb_id}.cof.moe.mol2".
@@ Line 22: / Line 28: @@
 For further clarification, review the student tutorials as indicated.
 ==III.Scripts for System Preparation==
@@ Line 86: / Line 91: @@
 Other common problems are with carboxylic acid and nitro functional groups. It is best to assign these functional groups "ar" bond types.
+(NOTE TO EDITOR- Mention swapaa for alanine script and include on github)
 ==III.C. Receptor Preparation==
@@ Line 193: / Line 200: @@
 ==III.H. Interpreting Bookkeeping Spreadsheet==
+NOTE TO AUTHOR - Bookeeping corrected for Na and Cl but still has issue with Heme, and defined mutation of nonstandard residue. Try to fix these.
 The first column that should be checked is "Cofactor Present Prep","Cofactor Present Rec" and "Cofactor Present Grid". These are associated with the files: ${pdb_id}.cof.moe.mol2 , ${pdb_id}.rec.clean.mol2 and grid.out , respectively. If all 3 columns are not in agreement for a given system, something is wrong.
@@ Line 225: / Line 234: @@
 ! style="width:60%" !|Details
 |-
-|  2GQG
+|  1QCF
 || Phosphotyrosine Residue
-|| Check Residue 171 correctly has Y2P Phosphotyrosine (See Picture below)
+|| Check Residue correctly has Y2P Phosphotyrosine (See Picture below)
 |-
 |  2Y03
@@ Line 259: / Line 268: @@
 [[Image:MHO_to_MET_Crop.png|thumb|center|210px|3D94 MHO to MET mutation]]
 -
 ==V.Organizing and Protecting Files==
-After the testset has been verified on all metrics discussed above, the relevant files for docking can be extracted and put in a minimal test set.
+After the testset has been verified on all metrics discussed above, the relevant files for docking can be extracted and put in a '''minimal test set'''.
       bash run.009.transfer_files.sh
@@ Line 268: / Line 278: @@
 The original prepared files in zzz.master should be left accessible, and should be available to propagate for future versions of the set.
       cp -r zzz.master /rizzo/lab/project/space/zzz.SB20XX_Testset
+The outfiles for each build should also be left accessible.
+     cp -r zzz.outfiles /rizzo/lab/project/space/zzz.SB20XX_Testset
 All original build files should be retained for future reference. Do not delete these files.
@@ Line 276: / Line 289: @@
       mv Name-of-tar-file.tar.gz /rizzo/lab/project/space/zzz.SB20XX_Testset
-Once a testset has been prepared for use by the lab, and it is probably a good idea to set all files to be read only. However directories must remain executable:
+Once a testset has been prepared for use by the lab, set all files to be read only. However directories must remain executable:
       chmod -R ugo-wx+Xr /rizzo/lab/project/space/zzz.SB20XX_Testset
-==VI.To Do==
+==VI. Notes on Different Folders==
+zzz.distribution is a copy of only dock necessary files including, to be made available in the Downloads sections with each major update of the set: $sys.rec.clean.mol2, box.pdb, $sys.lig.am1bcc.mol2, $sys.rec.clust.close.sph
+zzz.family_lists is all families with at least 4 systems containing the same Uniprot Recommended Name and meeting alignment criteria.
+zzz.master is all initial files which have split original pdb into its individual components, deleted unretained atoms, and protonated ligand and cofactor
+zzz.original_pdb is the original PDB fetched from rcsb page without editing. As of now all were fetched on date specified in README, but maybe add date for any added system in README
+zzz.outfiles is the output from system processing to terminal screen, saved in files for each step.
+zzz.swapped_res indicates which residues have been mutated (due to incomplete sidechains). This is not exhaustive as there is no record from SB2012, and was only included later in the expansion of SB2025.
+zzz.system_movies Scripts to generate simple Chimera movies to inspect system set up from zzz.master, compared to original pdb.
+zzz.testset_files is all intermediate files generate during system processing.
+==VII. Notes on Editing/ Correcting Set==
+A best practice for correcting any errors in a system is to modify the initial files in zzz.master ($sys.rec.foramber.pdb or $sys.lig.moe.mol2).
+Modifying a processed file may lead to errors in processing, due to modifications not expected further upstream.
+==VIII.To Do==
 . Integrate prep and frcmod files for cofactors available at http://amber.manchester.ac.uk/
 . Fix false positives in bookkeeping scripts

Difference between revisions of "Test Set Construction SB2024 V1 DOCK6.10 A"

Latest revision as of 13:13, 14 July 2025

Contents

I.Introduction

II. Preliminary File Preparation

III.Scripts for System Preparation

III.A. Sourcing Variables

III.B. Ligand Preparation

III.C. Receptor Preparation

III.D. Sphere Generation

III.E. Single Grid Generation

III.F. Multi Grid Generation

III.G. Bookkeeping of Heavy Atoms

III.H. Interpreting Bookkeeping Spreadsheet

IV.Test Cases for Preparation Integrity

V.Organizing and Protecting Files

VI. Notes on Different Folders

VII. Notes on Editing/ Correcting Set

VIII.To Do

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Rizzo Lab

Courses

Toolbox