Difference between revisions of "Database Enrichment SB2024 V1 DOCK6.10 A"

From Rizzo_Lab
Jump to: navigation, search
(IV.Ligand Enrichment Analysis)
 
(38 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The purpose of this tutorial is to develop a uniform method to test ligand enrichment across the Rizzo lab with the DOCK software.  
+
The purpose of this tutorial is to develop a uniform method to test ligand enrichment across the Rizzo lab with the DOCK software. Note any data in this tutorial is solely for the purpose of example.
  
 
==I.Introduction==
 
==I.Introduction==
Ligand Enrichment is a common experiment used to evaluate how well a docking program is capable of accurately modeling in vitro experiments. This experiment uses active ligands and decoy ligands to access a docking programs ability to successfully dock to a target site. These active and decoy ligands are roughly the same size and differ due to chemical similarities. These active ligands should bind more favorably(Have a lower energy score) then the decoy ligands if the docking program can accurately model these binding site and ligand interactions.  
+
Ligand Enrichment is an experiment used to evaluate how well a docking program can rank experimentally known binders (termed actives) over decoy molecules for a given target. These active and decoy ligands are ideally property matched meaning an active has decoys with similar physiochemical properties. These active ligands should bind more favorably(Have a lower energy score) then the decoy ligands if the docking program can accurately model these binding site and ligand interactions.  
  
 +
The 3 major outcomes for this experiment are early enrichment, random enrichment, and late enrichment. Early enrichment indicates the active ligands dock more successful in the experiment(The goal for all docking programs). The second is random enrichment indicating that the docking program cannot differentiate between active and decoy. Late enrichment indicating that docking software gives the lowest energy scores to the decoys which is the worst outcome.
  
The 3 major outcomes for this experiment are early enrichment, random enrichment, and late enrichment. Early enrichment indicates the active ligands dock more successful in the experiment(The goal for all docking programs). The second is random enrichment indicating that the docking program can differentiate between active and decoy. Late enrichment indicating that docking software gives the lowest energy scores to the decoys which is the worst outcome. The other factor to consider is the degree of early and late enrichment
+
==II.Prepping systems==
 +
'''For Rizzo Lab members, use most recent version of lab test set and proceed to step III.'''
  
==II.Prepping systems==
+
Otherwise first prepare ligand, receptor, sphere and grid files for each DUDE system using:
-The first step is to create a directory for the system you are preparing
+
    https://ringo.ams.stonybrook.edu/index.php/Test_Set_Tutorial_V1
 +
 
 +
After this is complete, enter uppermost directory of test set files:
 +
    cd /path/to/testset
 +
 
 +
The first step is to create directories.
 +
 
 +
    mkdir zzz.DUDE_Files
  
 +
Create subdirectory for each system you will run
 +
   
 
     mkdir 1Q4X
 
     mkdir 1Q4X
  
-The first step is to obtain the active and decoy ligand test set systems which can be found on the Schoichet DUD-E test set website http://dude.docking.org/targets
+
Then obtain the active and decoy ligands which can be found on the Schoichet DUD-E test set website http://dude.docking.org/targets. Once these targets are obtained unzip these files using the gzip command and move them into the appropriate subdirectory.
  
-Once these targets are obtained unzip these files using the gzip command to get the active and decoy forms
+
    cd 1Q4X
gzip -d actives_final.mol2.gz  
+
    gzip -d actives_final.mol2.gz  
gzip -d decoys_final.mol2.gz  
+
    gzip -d decoys_final.mol2.gz  
 
      
 
      
-Prepare the target receptor by either using the official test set ligands or manually prepare a receptor target from scratch
+
Prepare the target receptor by either using the official SB2023 test set files (to be published) or prepare the receptor associated with the PDB using run000 to run004 in https://github.com/rizzolab/Testset_Protocols and move relevant files into the directory ~/testset/1Q4X
  
Following all these steps your directory should look like the following using the 1QRX system
+
Following all these steps you should have a separate subdirectory for each system with the following files:
  
 
     actives_final.mol2
 
     actives_final.mol2
 
     decoys_final.mol2
 
     decoys_final.mol2
    1Q4X.rec.clean.mol2
 
  
 
==III.Docking molecules==
 
==III.Docking molecules==
-After completing this step a virtual screen will be conducted using mpi for both the active and decoy ligands seperately
+
Now that files are ready for docking step a virtual screen will be conducted for both the active and decoy ligands separately.
  
-The input parameters are as follows for the active ligands
+
Pull Database Enrichment scripts from
 +
    https://github.com/rizzolab/Benchmarking_and_Validation
  
conformer_search_type                                        flex
+
Enter Database Enrichment folder:
write_fragment_libraries                                    no
+
    cd Benchmarking_and_Validation/DatabaseEnrichment/
user_specified_anchor                                        no
 
limit_max_anchors                                            no
 
min_anchor_size                                              5
 
pruning_use_clustering                                      yes
 
pruning_max_orients                                          1000
 
pruning_clustering_cutoff                                    100
 
pruning_conformer_score_cutoff                              100.0
 
pruning_conformer_score_scaling_factor                      1.0
 
use_clash_overlap                                            no
 
write_growth_tree                                            no
 
use_internal_energy                                          yes
 
internal_energy_rep_exp                                      12
 
internal_energy_cutoff                                      100.0
 
ligand_atom_file                                            actives_final.mol2
 
limit_max_ligands                                            no
 
skip_molecule                                                no
 
read_mol_solvation                                          no
 
calculate_rmsd                                              no
 
use_database_filter                                          no
 
orient_ligand                                                yes
 
automated_matching                                          yes
 
receptor_site_file                                          /gpfs/projects/rizzo/ccorbo/2020_DUDE_0.3_gridspacing/DUDE_Good_to_go/1Q4X/1Q4X.rec.clust.close.sph
 
max_orientations                                            1000
 
critical_points                                              no
 
chemical_matching                                            no
 
use_ligand_spheres                                          no
 
bump_filter                                                  no
 
score_molecules                                              yes
 
contact_score_primary                                        no
 
contact_score_secondary                                      no
 
grid_score_primary                                          yes
 
grid_score_secondary                                        no
 
grid_score_rep_rad_scale                                    1
 
grid_score_vdw_scale                                        1
 
grid_score_es_scale                                          1
 
grid_score_grid_prefix                                      /gpfs/projects/rizzo/ccorbo/2020_DUDE_0.3_gridspacing/DUDE_Good_to_go/1Q4X/1Q4X.rec
 
multigrid_score_secondary                                    no
 
dock3.5_score_secondary                                      no
 
continuous_score_secondary                                  no
 
footprint_similarity_score_secondary                        no
 
pharmacophore_score_secondary                                no
 
descriptor_score_secondary                                  no
 
gbsa_zou_score_secondary                                    no
 
gbsa_hawkins_score_secondary                                no
 
SASA_score_secondary                                        no
 
amber_score_secondary                                        no
 
minimize_ligand                                              yes
 
minimize_anchor                                              yes
 
minimize_flexible_growth                                    yes
 
use_advanced_simplex_parameters                              no
 
simplex_max_cycles                                          1
 
simplex_score_converge                                      0.1
 
simplex_cycle_converge                                      1.0
 
simplex_trans_step                                          1.0
 
simplex_rot_step                                            0.1
 
simplex_tors_step                                            10.0
 
simplex_anchor_max_iterations                                500
 
simplex_grow_max_iterations                                  500
 
simplex_grow_tors_premin_iterations                          0
 
simplex_random_seed                                          0
 
simplex_restraint_min                                        no
 
atom_model                                                  all
 
vdw_defn_file                                                /gpfs/projects/rizzo/zzz.programs/dock6.9_release/parameters/vdw_AMBER_parm99.defn
 
flex_defn_file                                              /gpfs/projects/rizzo/zzz.programs/dock6.9_release/parameters/flex.defn
 
flex_drive_file                                              /gpfs/projects/rizzo/zzz.programs/dock6.9_release/parameters/flex_drive.tbl
 
ligand_outfile_prefix                                        1Q4X.active.output.mpi
 
write_orientations                                          no
 
num_scored_conformers                                        1
 
rank_ligands                                                no
 
  
-The input parameters for the decoy ligands.
+
001.submit.sh has #SBATCH header for submitting to an HPC, such as seawulf. If not using an HPC, delete #SBATCH lines.
  
conformer_search_type                                        flex
+
Enter required parameters in script
write_fragment_libraries                                    no
 
user_specified_anchor                                        no
 
limit_max_anchors                                            no
 
min_anchor_size                                              5
 
pruning_use_clustering                                      yes
 
pruning_max_orients                                          1000
 
pruning_clustering_cutoff                                    100
 
pruning_conformer_score_cutoff                              100.0
 
pruning_conformer_score_scaling_factor                      1.0
 
use_clash_overlap                                            no
 
write_growth_tree                                            no
 
use_internal_energy                                          yes
 
internal_energy_rep_exp                                      12
 
internal_energy_cutoff                                      100.0
 
ligand_atom_file                                            decoys_final.mol2
 
limit_max_ligands                                            no
 
skip_molecule                                                no
 
read_mol_solvation                                          no
 
calculate_rmsd                                              no
 
use_database_filter                                          no
 
orient_ligand                                                yes
 
automated_matching                                          yes
 
receptor_site_file                                          /gpfs/projects/rizzo/ccorbo/2020_DUDE_0.3_gridspacing/DUDE_Good_to_go/1Q4X/1Q4X.rec.clust.close.sph
 
max_orientations                                            1000
 
critical_points                                              no
 
chemical_matching                                            no
 
use_ligand_spheres                                          no
 
bump_filter                                                  no
 
score_molecules                                              yes
 
contact_score_primary                                        no
 
contact_score_secondary                                      no
 
grid_score_primary                                          yes
 
grid_score_secondary                                        no
 
grid_score_rep_rad_scale                                    1
 
grid_score_vdw_scale                                        1
 
grid_score_es_scale                                          1
 
grid_score_grid_prefix                                      /gpfs/projects/rizzo/ccorbo/2020_DUDE_0.3_gridspacing/DUDE_Good_to_go/1Q4X/1Q4X.rec
 
multigrid_score_secondary                                    no
 
dock3.5_score_secondary                                      no
 
continuous_score_secondary                                  no
 
footprint_similarity_score_secondary                        no
 
pharmacophore_score_secondary                                no
 
descriptor_score_secondary                                  no
 
gbsa_zou_score_secondary                                    no
 
gbsa_hawkins_score_secondary                                no
 
SASA_score_secondary                                        no
 
amber_score_secondary                                        no
 
minimize_ligand                                              yes
 
minimize_anchor                                              yes
 
minimize_flexible_growth                                    yes
 
use_advanced_simplex_parameters                              no
 
simplex_max_cycles                                          1
 
simplex_score_converge                                      0.1
 
simplex_cycle_converge                                      1.0
 
simplex_trans_step                                          1.0
 
simplex_rot_step                                            0.1
 
simplex_tors_step                                            10.0
 
simplex_anchor_max_iterations                                500
 
simplex_grow_max_iterations                                  500
 
simplex_grow_tors_premin_iterations                          0
 
simplex_random_seed                                          0
 
simplex_restraint_min                                        no
 
atom_model                                                  all
 
vdw_defn_file                                                /gpfs/projects/rizzo/zzz.programs/dock6.9_release/parameters/vdw_AMBER_parm99.defn
 
flex_defn_file                                              /gpfs/projects/rizzo/zzz.programs/dock6.9_release/parameters/flex.defn
 
flex_drive_file                                              /gpfs/projects/rizzo/zzz.programs/dock6.9_release/parameters/flex_drive.tbl
 
ligand_outfile_prefix                                        1Q4X.decoy.output.mpi
 
write_orientations                                          no
 
num_scored_conformers                                        1
 
rank_ligands                                                no
 
  
 +
    testset=" Path to folder with all system subdirectories"
 +
    system_file=" List of systems to run"
 +
        ie: 1Q4X
 +
            1BCD
 +
            1SJ0
 +
            ...
 +
    dock=" Path to dock uppermost folder"
 +
    mpi="Yes / No" - do you want to run in parallel
 +
    processes=" Number of processes" - only set if mpi = Yes
  
-Then submit the script to the qsub to dock the molecule in parallel. Some of the ligand active and decoy testsets are quite large so mpi submission is recommended.
+
    sbatch or bash 001.submit.sh
#!/bin/bash
 
#SBATCH --partition=rn-long-40core
 
#SBATCH --time=48:00:00
 
#SBATCH --nodes=4
 
#SBATCH --ntasks=160
 
#SBATCH --job-name=1B9V_mpi_runs
 
#SBATCH --output=1B9V_mpi_runs
 
  
  cd $SLURM_SUBMIT_DIR
+
After docking has completed, the folder testset/1Q4X will now have the following files, as well as input and output docking files:
  module load intel/mpi/64/2018/18.0.3
+
 
mpirun -np 160 dock6.mpi -i 1Q4X_active_mpi.in -o 1Q4X_decoy_mpi.out
+
    1Q4X_actives.FLX_scored.mol2 
mpirun -np 160 dock6.mpi -i 1Q4X_decoy_mpi.in -o 1Q4X_decoy_mpi.out
+
    1Q4X_decoys.FLX_scored.mol2 
 +
    Active_score.txt  
 +
    Decoy_score.txt           
 +
    All_score.txt  
 +
    All_score_sort.txt
 +
 
 +
All_score_sort.txt will have the list of actives and decoys and their associated ranked scores:
 +
 
 +
    -105.160493 Decoy
 +
    -105.037376 Active
 +
    -104.870392 Decoy
 +
    -103.900323 Decoy
 +
    -103.186615 Active
 +
    -103.178314 Decoy
 +
    ...
  
 
==IV.Ligand Enrichment Analysis==
 
==IV.Ligand Enrichment Analysis==
  
-Lastly, 2 scripts were developed to analyze the results. One script to generate a CSV file and a secondary script that uses the CSV data to create a graph.
+
002.analysis.sh assumes anaconda/3 is installed as a module. If not the bash script can be edited for the python scripts to be run externally with python3.
  
-The script that generates the CSV file takes three parameters, the list of systems, name of decoy ligands mol2 file, name of active ligands mol2 file.
+
Before running 002.analysis.sh again fill in parameters "testset" and "system_file" with same previous values.
(NOTE: This script can generate multiple CSV files for different ligand experiments, but the naming of the active and decoy mol2 files must be the same,)
 
The 1Q4X.txt file has the following text
 
1Q4X
 
If your creating multiple csvs for multiple systems the format fill be
 
1Q4X
 
1LRU
 
1SYN
 
etc
 
  
This script is run one directory before the 1Q4X directory(Not in the 1Q4X directory)
+
    bash 002.analysis.sh
1Q4X/
 
  
Example:
+
Some "philosophical" decisions are built into these scripts and are important to be aware of:
python roc_curve_lig_enrichment_v2.py 1Q4X.txt decoys_final.mol2 actives_final.mol2
 
  
This produces the csv file in the 1Q4X
+
    1. Actives and decoys which do not successfully dock are added to the end of the ranked list at a random enrichment rate (actives and decoys equally interspersed)
1Q4X_lig_enrichment.csv
+
    2. Active and Decoy mol2 may have multiple protomers of the same ligand. These scripts retain all protomers for rescoring, although it may be desireable to retain only the best scoring protomer of each molecule.
 +
 
 +
This will generate a roc curve for each system and place it in the file:
 +
    ~testset/plots/1Q4X_Enrichment.png
  
Following this a python script is used to create a graph to analyze the results
 
First change directory into the 1Q4X directory
 
cd 1Q4X
 
  
Then run the script make_roc_curve.py CSV_file Name
 
(Note: Name can be anything)
 
Example:
 
python ../make_roc_curve.py 1Q4X_lig_enrichment.csv DOCK6
 
 
  [[File:1Q4X_ligand_enrichment_DOCK6.9.png]]
 
  [[File:1Q4X_ligand_enrichment_DOCK6.9.png]]
 +
 +
There will also be a file quantifying the outcome:
 +
    Statistics.txt
 +
 +
    1Q4X
 +
    1%
 +
    AUC is 5.149840284033384
 +
    Actives Count is 7
 +
    Decoys Count is 24
 +
    10%
 +
    AUC is 304.8390844661322
 +
    Actives Count is 40
 +
    Decoys Count is 270
 +
    100%
 +
    AUC is 8236.886617507042
 +
 +
Under the header of 1% indicates the AUC at 1% of the database screened. This is a measure of early enrichment, with maximum enrichment being 100.0 and random enrichment being 0.5.
 +
 +
If top 1% scoring molecules of the entire database (3,100 actives + decoys) were purchased for experimental validation (31 molecules), 7 would have been actives and 24 decoys.
 +
 +
 +
-SEE README FILE IN GIT REPO FOR ADDTIONAL DETAILS THAT MAY NOT BE COVERED HERE
 +
 +
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
 +
 +
Tutorial Written By: Christopher Corbo and Scott Laverty, Rizzo Lab, Stony Brook University (This tutorial was last updated 02/19/2024)
 +
 +
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Latest revision as of 15:36, 13 March 2024

The purpose of this tutorial is to develop a uniform method to test ligand enrichment across the Rizzo lab with the DOCK software. Note any data in this tutorial is solely for the purpose of example.

I.Introduction

Ligand Enrichment is an experiment used to evaluate how well a docking program can rank experimentally known binders (termed actives) over decoy molecules for a given target. These active and decoy ligands are ideally property matched meaning an active has decoys with similar physiochemical properties. These active ligands should bind more favorably(Have a lower energy score) then the decoy ligands if the docking program can accurately model these binding site and ligand interactions.

The 3 major outcomes for this experiment are early enrichment, random enrichment, and late enrichment. Early enrichment indicates the active ligands dock more successful in the experiment(The goal for all docking programs). The second is random enrichment indicating that the docking program cannot differentiate between active and decoy. Late enrichment indicating that docking software gives the lowest energy scores to the decoys which is the worst outcome.

II.Prepping systems

For Rizzo Lab members, use most recent version of lab test set and proceed to step III.

Otherwise first prepare ligand, receptor, sphere and grid files for each DUDE system using:

    https://ringo.ams.stonybrook.edu/index.php/Test_Set_Tutorial_V1

After this is complete, enter uppermost directory of test set files:

    cd /path/to/testset

The first step is to create directories.

    mkdir zzz.DUDE_Files

Create subdirectory for each system you will run

    mkdir 1Q4X

Then obtain the active and decoy ligands which can be found on the Schoichet DUD-E test set website http://dude.docking.org/targets. Once these targets are obtained unzip these files using the gzip command and move them into the appropriate subdirectory.

    cd 1Q4X
    gzip -d actives_final.mol2.gz 
    gzip -d decoys_final.mol2.gz 
   

Prepare the target receptor by either using the official SB2023 test set files (to be published) or prepare the receptor associated with the PDB using run000 to run004 in https://github.com/rizzolab/Testset_Protocols and move relevant files into the directory ~/testset/1Q4X

Following all these steps you should have a separate subdirectory for each system with the following files:

    actives_final.mol2
    decoys_final.mol2

III.Docking molecules

Now that files are ready for docking step a virtual screen will be conducted for both the active and decoy ligands separately.

Pull Database Enrichment scripts from

    https://github.com/rizzolab/Benchmarking_and_Validation

Enter Database Enrichment folder:

    cd Benchmarking_and_Validation/DatabaseEnrichment/

001.submit.sh has #SBATCH header for submitting to an HPC, such as seawulf. If not using an HPC, delete #SBATCH lines.

Enter required parameters in script

    testset=" Path to folder with all system subdirectories"
    system_file=" List of systems to run"
       ie: 1Q4X
           1BCD
           1SJ0
           ...
    dock=" Path to dock uppermost folder"
    mpi="Yes / No" - do you want to run in parallel
    processes=" Number of processes" - only set if mpi = Yes
    sbatch or bash 001.submit.sh

After docking has completed, the folder testset/1Q4X will now have the following files, as well as input and output docking files:

    1Q4X_actives.FLX_scored.mol2  
    1Q4X_decoys.FLX_scored.mol2   
    Active_score.txt  
    Decoy_score.txt            
    All_score.txt  
    All_score_sort.txt

All_score_sort.txt will have the list of actives and decoys and their associated ranked scores:

    -105.160493 Decoy
    -105.037376 Active
    -104.870392 Decoy
    -103.900323 Decoy
    -103.186615 Active
    -103.178314 Decoy
    ...

IV.Ligand Enrichment Analysis

002.analysis.sh assumes anaconda/3 is installed as a module. If not the bash script can be edited for the python scripts to be run externally with python3.

Before running 002.analysis.sh again fill in parameters "testset" and "system_file" with same previous values.

    bash 002.analysis.sh

Some "philosophical" decisions are built into these scripts and are important to be aware of:

    1. Actives and decoys which do not successfully dock are added to the end of the ranked list at a random enrichment rate (actives and decoys equally interspersed)
    2. Active and Decoy mol2 may have multiple protomers of the same ligand. These scripts retain all protomers for rescoring, although it may be desireable to retain only the best scoring protomer of each molecule.

This will generate a roc curve for each system and place it in the file:

    ~testset/plots/1Q4X_Enrichment.png


1Q4X ligand enrichment DOCK6.9.png

There will also be a file quantifying the outcome:

    Statistics.txt
    1Q4X
    1%
    AUC is 5.149840284033384
    Actives Count is 7
    Decoys Count is 24
    10%
    AUC is 304.8390844661322
    Actives Count is 40
    Decoys Count is 270
    100%
    AUC is 8236.886617507042

Under the header of 1% indicates the AUC at 1% of the database screened. This is a measure of early enrichment, with maximum enrichment being 100.0 and random enrichment being 0.5.

If top 1% scoring molecules of the entire database (3,100 actives + decoys) were purchased for experimental validation (31 molecules), 7 would have been actives and 24 decoys.


-SEE README FILE IN GIT REPO FOR ADDTIONAL DETAILS THAT MAY NOT BE COVERED HERE

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Tutorial Written By: Christopher Corbo and Scott Laverty, Rizzo Lab, Stony Brook University (This tutorial was last updated 02/19/2024)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>