Difference between revisions of "Virtual Screening Protocol"
(→Clustering using MOE) |
|||
(9 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | In this document, the current Rizzo group protocol will be described in detail. This protocol has be through successive iterations and has be used to select compounds for Flu and HIVgp41 ( | + | To Do List for new VS-protocol: |
+ | ==zinc15 library== | ||
+ | Determine how to best make chunks | ||
+ | make new chunks for zinc15 | ||
+ | ==reference minimization== | ||
+ | turn on reference minimization | ||
+ | not minimize the reference nine times (one for each scoring function) | ||
+ | Put the vs-protocol on git | ||
+ | |||
+ | In this document, the current Rizzo group protocol will be described in detail. This protocol has be through successive iterations and has be used to select compounds for Flu and HIVgp41 (number of purchased compounds for HIV = 112). | ||
==ZINC Database== | ==ZINC Database== | ||
[http://zinc.docking.org/ ZINC] is a free, online database of commercially-available compounds. In June 2009, the "big" vendor libraries at a pH of 7 were downloaded and processed into chunks of approximately 100,000 molecules each. The ZINC molecules come with AMSOL charges already determined. | [http://zinc.docking.org/ ZINC] is a free, online database of commercially-available compounds. In June 2009, the "big" vendor libraries at a pH of 7 were downloaded and processed into chunks of approximately 100,000 molecules each. The ZINC molecules come with AMSOL charges already determined. | ||
− | These chunks are then processed through MOE and sorted by rotatable bonds in ascending order. These processed chunks are available on BG, path=~pholden/RCR/projects_ZINC8/ZINC8, or on ringo, path=/media/sdb1/pholden/ZINC8 | + | These chunks are then processed through MOE and sorted by rotatable bonds in ascending order. These processed chunks are available on BG, path=~pholden/RCR/projects_ZINC8/ZINC8, or on ringo, path=/media/sdb1/pholden/ZINC8. The lab is currently using the ChemDiv library for its virtual screen. |
==Downloading and Preparation of Receptor== | ==Downloading and Preparation of Receptor== | ||
Line 99: | Line 108: | ||
1. It first combines separate chunks into one larger chunk and ranks them, e.g. Combining ChemDiv sets 1-5 into one larger set. | 1. It first combines separate chunks into one larger chunk and ranks them, e.g. Combining ChemDiv sets 1-5 into one larger set. | ||
2. It skims off a user-selected number of molecules to be clustered, e.g. 50,000. This ensures that all molecules in the clustering stage are within the top percentage of the virtual screen. | 2. It skims off a user-selected number of molecules to be clustered, e.g. 50,000. This ensures that all molecules in the clustering stage are within the top percentage of the virtual screen. | ||
− | 3. | + | 3. It then runs MOE, which clusters the molecules. |
/space1/goyal/RCR/PROGRAMS/moe/bin/moebatch -exec "\ | /space1/goyal/RCR/PROGRAMS/moe/bin/moebatch -exec "\ | ||
Line 134: | Line 143: | ||
==Compound Selection== | ==Compound Selection== | ||
+ | During compound selection, the molecules that will be purchased for further experimentation will be selected. In the previous step, a series of multimol2 files and corresponding csv files were produced; these different files are rank-ordered by different descriptors. | ||
+ | |||
+ | Steps: | ||
+ | 1. Load the receptor file into chimera. If you wish, determine the receptors surface and also load in the reference molecule. | ||
+ | |||
+ | 2. Load a mol2 into chimera using the viewdock function. | ||
+ | |||
+ | 3. Using the viewdock function, load the different possible descriptors into the window for easy viewing. | ||
+ | |||
+ | Concurrent to these steps, load the csv into excel. | ||
+ | |||
+ | 4. Select molecules. Record wish molecules you wish to examine further. | ||
+ | |||
+ | 5. Take the list of ZINCIDS and load them into the ZINC search function. Ensure that these molecules are still in the ZINC database. Note: as the database we are using gets older, more and more molecules will not be in the database. | ||
+ | |||
+ | 6. The molecules in ZINC may have a link to an entry in PubChem. Examine each of these to determine if there are any known activities or toxicology issues. Note: using the cdiv database dramatically reduce the necessity of this as the cdiv database is possibly prescreened for possible toxicology issues. | ||
+ | |||
+ | 7. Using the list generated in ZINC, use the "Purchase Molecules" button at the top to generate an excel file containing all of the purchasing information for these molecules: namely, the vendors and their respective ids. | ||
+ | |||
+ | 8. Sort by vendor in the excel file, and match the vendor ids to their respective zincids. | ||
+ | |||
+ | 9. For cdiv, they use eMolecules.com as their up-to-date supply information. Load the vendor ids into eMolecules to determine if the molecule is in stock. | ||
+ | |||
+ | 10. Put together a PO. | ||
+ | |||
+ | ==Substructure Searching on Experimentally Determined Hits== | ||
+ | Once an experimental hit is obtained, a substructure similarity search is undertaken. Currently, we are trying a number of approaches. | ||
+ | |||
+ | First, one should obtain the cluster members, which can be easily obtained from the "families" csv file. This list is not exhaustive, but only from the top 10% of the library that you screened. To make it a complete list, the following step is taken. | ||
+ | |||
+ | Load the entire cdiv library into MOE, determine the molecular fingerprints for each database entry, and then use the hits as queries to find similar molecules. These will then be docked into the receptor, and if the molecules have a similar binding mode, purchased for further testing. If you have already done a complete docking of the vendor library, you can obtain the dock results from the original dock result mol2 files using Rashi's "run.retrieve_mol2.py" file, found on herbie:/space1/goyal/RCR/projects_BG/screening/scripts. |
Latest revision as of 11:10, 4 December 2018
To Do List for new VS-protocol:
Contents
- 1 zinc15 library
- 2 reference minimization
- 3 ZINC Database
- 4 Downloading and Preparation of Receptor
- 5 Preparation of Reference Molecule for Footprint Rescoring
- 6 DOCKing
- 7 Minimization of Poses
- 8 Pose Rescoring using Molecular Footprints
- 9 Clustering using MOE
- 10 Compound Selection
- 11 Substructure Searching on Experimentally Determined Hits
zinc15 library
Determine how to best make chunks make new chunks for zinc15
reference minimization
turn on reference minimization not minimize the reference nine times (one for each scoring function) Put the vs-protocol on git
In this document, the current Rizzo group protocol will be described in detail. This protocol has be through successive iterations and has be used to select compounds for Flu and HIVgp41 (number of purchased compounds for HIV = 112).
ZINC Database
ZINC is a free, online database of commercially-available compounds. In June 2009, the "big" vendor libraries at a pH of 7 were downloaded and processed into chunks of approximately 100,000 molecules each. The ZINC molecules come with AMSOL charges already determined.
These chunks are then processed through MOE and sorted by rotatable bonds in ascending order. These processed chunks are available on BG, path=~pholden/RCR/projects_ZINC8/ZINC8, or on ringo, path=/media/sdb1/pholden/ZINC8. The lab is currently using the ChemDiv library for its virtual screen.
Downloading and Preparation of Receptor
Preparation of Reference Molecule for Footprint Rescoring
The reference molecule is the molecule whose footprint guides the selection of molecules during footprint rescoring. The reference molecule will most likely be a native substrate or known inhibitor. The molecule should be prepared with charges unless it already has charges. These charges can be applied using MOE or antechamber (AmberTools). The charge model selected is not of great importance, as per Sudipto's testset paper.
Once the molecule is prepared, it should be minimized in the receptor using a tethered minimization.
An example input file:
ligand_atom_file ligand.mol2 limit_max_ligands no skip_molecule no read_mol_solvation no calculate_rmsd yes use_rmsd_reference_mol no use_database_filter no orient_ligand no use_internal_energy yes internal_energy_rep_exp 12 flexible_ligand no bump_filter no score_molecules yes contact_score_primary no contact_score_secondary no grid_score_primary no grid_score_secondary no dock3.5_score_primary no dock3.5_score_secondary no continuous_score_primary yes continuous_score_secondary no cont_score_rec_filename receptor.mol2 cont_score_att_exp 6 cont_score_rep_exp 12 cont_score_rep_rad_scale 1 cont_score_use_dist_dep_dielectric yes cont_score_dielectric 4.0 cont_score_vdw_scale 1 cont_score_es_scale 1 gbsa_zou_score_secondary no gbsa_hawkins_score_secondary no amber_score_secondary no minimize_ligand yes simplex_max_iterations 1000 simplex_tors_premin_iterations 0 simplex_max_cycles 1 simplex_score_converge 0.1 simplex_cycle_converge 1.0 simplex_trans_step 1.0 simplex_rot_step 0.1 simplex_tors_step 10.0 simplex_random_seed 0 simplex_restraint_min yes simplex_restraint_coefficient 10.0 atom_model all vdw_defn_file /sbhome0/sudipto/RCR/projects_BNL/parameters/vdw_AMBER_parm99.defn flex_defn_file /sbhome0/sudipto/RCR/projects_BNL/parameters/flex.defn flex_drive_file /sbhome0/sudipto/RCR/projects_BNL/parameters/flex_drive.tbl ligand_outfile_prefix xtalmin write_orientations no num_scored_conformers 1 rank_ligands no
DOCKing
The version of DOCK currently being used for virtual screening is dock6_09-09-08.footprint. The most up to date script is Trent's modifications, which include the database filter. The database filter removes molecules with charges greater than +2 and less than -2, and also molecules with greater than 15 rotatable bonds.
The script also takes the processed chunks from ZINC and splits them into two subsections: the first 60,000 and the remainder. Because the sets are sorted by rotatable bonds, the first 60,000 should dock in fairly quickly, and be completed within the allowable wallclock limit. The second set, or molecule 60,001 and beyond, will have higher numbers of rotatable bonds, and take exponentially more time for each molecule. As per Sudipto's testset paper, we also have less confidence in the molecules with more rotatable bonds. If this second job does not finish, DO NOT restart it. The molecules that did not dock in will have higher numbers of rotatable bonds (should this be revised in light of the database filtering?)
Note that the values for the virtual screen are the default values.
Path for script: ~balius/RCR/projects_BG/screening/run.dock2grid.max_lig_tebmod.csh
Minimization of Poses
In this step, the two multimol2s generated from the virtual screen step are recombined and minimized on the continuous receptor. This step removes artifacts from the lower resolution grid, generally resolving clashes with sidechains. The minimization will be tethered with a 10kcal/mol restraint, such that with each step of minimization, the molecule cannot move too much. The original DOCK pose is more likely to be kept with this method.
Path of script: ~pholden/RCR/projects_BG/screening/HIV/run.minoffgrid.csh
Pose Rescoring using Molecular Footprints
In this step, the minimized molecules are rescoring using the molecular footprints. The footprint is a per-residue decomposition of the interaction energies between the ligand and the receptor. These energies will then be compared to a reference molecule's footprint, prepared earlier, and a numerical comparison between the two will be determined. The possible comparison techniques include a Pearson Correlation, Euclidean, Normalized Euclidean, and a threshold based method, where only values exceeding a certain energy will be compared. The energies compared are van der waal's energy (vdw_fp), electrostatic energy (es_fp), and the number of hydrogen bonds (hb_fp). Combinations of the footprints can also be determined.
The script simply reads in a multimol2 and a reference and reports the per-residue decomposed energies and the actual footprint value.
For Pearson, values close to 1 are best. For Euclidean, values close to 0 are best.
Path: ~pholden/RCR/projects_BG/screening/HIV/run.footprint.csh
Clustering using MOE
The function of this step is two-fold. First, the molecules will be clustered together using MACCS fingerprints, and the top-scoring molecule within a given cluster, or clusterhead, will be retained. By clustering the molecules together by MACCS fingerprints, the rank-ordered lists will have increased chemical diversity by removing similar molecules from the set. This, in turn, will allow the user to see and choose more chemically diverse molecules, decreasing risk during the subsequent step of purchasing and testing molecules.
The multimol2 output from the footprint rescoring step should be moved to George, or some similarly powerful machine. Teh clustering will be performed by MOE using the following script: <Trent's path for this script>. Note, there are two different scripts depending on if the Pearson Correlation was used or if Euclidean distance was used during the footprinting.
The scripts performs a couple of functions: 1. It first combines separate chunks into one larger chunk and ranks them, e.g. Combining ChemDiv sets 1-5 into one larger set. 2. It skims off a user-selected number of molecules to be clustered, e.g. 50,000. This ensures that all molecules in the clustering stage are within the top percentage of the virtual screen. 3. It then runs MOE, which clusters the molecules. /space1/goyal/RCR/PROGRAMS/moe/bin/moebatch -exec "\ mdb2 = db_Open ['$vendor.$combinedset.desc.mdb','create'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'mol','molecule'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'Weight','float'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'b_rotN','int'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'lip_don','int'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'lip_acc','int'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'lip_druglike','int'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'lip_violation','int'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'SlogP','float'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'FCharge','int'];\ db_CreateField ['$vendor.$combinedset.desc.mdb', 'logS','float'];\ db_ImportMOL2 ['$vendor.$combinedset.footprint_ranked_${max_num}.mol2','$vendor.$combinedset.desc.mdb','mol'];\ db_ExtractMoleculeName [ mdb2, 'name'];\ db_sm_Extract [mdb2, 'SMILES'];\ codes = [ 'Weight', 'b_rotN', 'lip_don', 'lip_acc', 'lip_druglike', 'lip_violation', 'SlogP', 'FCharge','logS'];\ codes_2 = [ 'name', 'SMILES','CLUSTER_NO','Weight', 'b_rotN', 'lip_don', 'lip_acc', 'lip_druglike', 'lip_violation', 'SlogP', 'FCharge','logS'];\ QuaSAR_DescriptorMDB ['${vendor}.$combinedset.desc.mdb','mol', codes ];\ ph4_FingerprintMDB ['$vendor.$combinedset.desc.mdb', 'mol', 'FP:BIT_MACCS', 0];\ ph4_ClusterMDB ['$vendor.$combinedset.desc.mdb', 'FP:BIT_MACCS' , 'tanimoto' , [ overlap:0.75, sim:0.75, cfield:'CLUSTER_NO', esel:0, mfield:'mol'] ];\ db_Close mdb2;\ mdb3 = db_Open ['$vendor.$combinedset.desc.mdb','read'];\ entries = db_Entries '$vendor.$combinedset.desc.mdb';\ db_ExportASCII ['$vendor.$combinedset.desc.mdb', '$vendor.$combinedset.descriptors.csv', codes_2 , entries [delimiter:',',quotes:0,titles:1]];\ db_Close mdb3" Note that tihs script uses a tanimoto similarity coefficient of 0.75 during the clustering. 4. The script then combines two csvs, one generated from DOCK, which includes the various descriptors from the virtual screen, and one from MOE, which includes the various properties determined by MOE. 5. Finally, this script creates ranked ordered lists for the various ranking methods: total score, vdw_fp, es_fp, hbonds, hb_fp, ligand efficiency, and sum1_fp (vdw+es).
Finally, one must run the run.retrieve_mols.new.csh script <path>, which takes the rank-ordered lists produced in the previous step and creates lists that show just the top 500 clusterheads and the top 500 families (all the members from a given cluster based on the clusterhead position). These will be used in the compound selection stage.
Compound Selection
During compound selection, the molecules that will be purchased for further experimentation will be selected. In the previous step, a series of multimol2 files and corresponding csv files were produced; these different files are rank-ordered by different descriptors.
Steps: 1. Load the receptor file into chimera. If you wish, determine the receptors surface and also load in the reference molecule.
2. Load a mol2 into chimera using the viewdock function.
3. Using the viewdock function, load the different possible descriptors into the window for easy viewing.
Concurrent to these steps, load the csv into excel.
4. Select molecules. Record wish molecules you wish to examine further.
5. Take the list of ZINCIDS and load them into the ZINC search function. Ensure that these molecules are still in the ZINC database. Note: as the database we are using gets older, more and more molecules will not be in the database.
6. The molecules in ZINC may have a link to an entry in PubChem. Examine each of these to determine if there are any known activities or toxicology issues. Note: using the cdiv database dramatically reduce the necessity of this as the cdiv database is possibly prescreened for possible toxicology issues.
7. Using the list generated in ZINC, use the "Purchase Molecules" button at the top to generate an excel file containing all of the purchasing information for these molecules: namely, the vendors and their respective ids.
8. Sort by vendor in the excel file, and match the vendor ids to their respective zincids.
9. For cdiv, they use eMolecules.com as their up-to-date supply information. Load the vendor ids into eMolecules to determine if the molecule is in stock.
10. Put together a PO.
Substructure Searching on Experimentally Determined Hits
Once an experimental hit is obtained, a substructure similarity search is undertaken. Currently, we are trying a number of approaches.
First, one should obtain the cluster members, which can be easily obtained from the "families" csv file. This list is not exhaustive, but only from the top 10% of the library that you screened. To make it a complete list, the following step is taken.
Load the entire cdiv library into MOE, determine the molecular fingerprints for each database entry, and then use the hits as queries to find similar molecules. These will then be docked into the receptor, and if the molecules have a similar binding mode, purchased for further testing. If you have already done a complete docking of the vendor library, you can obtain the dock results from the original dock result mol2 files using Rashi's "run.retrieve_mol2.py" file, found on herbie:/space1/goyal/RCR/projects_BG/screening/scripts.