Joe and Brian de novo stuff

From Rizzo_Lab
Jump to: navigation, search

This is Joe and Brian's "secret" wiki page for coordinating notes on the de novo project.

Version of the code on cluster that we should be using:

Lauren:

/gpfs/home/leprentis/dock6.8_merge_06.14.2017
This version includes the orienting fragments fix, and rotatable bond fix.

Joe:

/sbhome1/wjallen/work/build/dock.6.7_2015-02-17.denovo_paper/src/dock/dock6

Brian:

/sbhome1/fochtman/RCR/projects_BNL/build/dock.6.7_2015-02-17.denovo_paper.2016.05.04/src/dock/dock6

Made changes to the graph method. Will update exhaustive and random if we decide to keep the changes.

We will probably do a merge with dock 6.8 when that is released. Brian: If you have any edits to the code, do a diff and e-mail me the results / path to your modified code. I will merge it into my branch and confirm with you that I made that change.

DOCK6.8 merge: de novo and dock6.8 were merged (/gpfs/home/leprentis/dock6.8_merge_06.14.2017)

Path to Library

/sbhome1/wjallen/work/de-novo-test/gen-frags-12
 Generic fragment library is unchanged (/gpfs/projects/rizzo/leprentis/gen-frags-12)

Path to Frequency Anchors:

/gpfs/projects/rizzo/leprentis/zinc1_ancs_freq


Current Coding Issues:

Working on these currently (Lauren):

  1. Lauren: Adjacency matrix vs tors env
  2. Lauren: torsion table maybe used as triple fragment check point
  3. Dwight: Min and Max for charge to replace absolute value of charge.
  4. Lauren: Check 663 (5through15_ch2) systems with merged dock6.8 and check analysis against de novo paper analysis.
  5. Lauren: Fix Frag_String output into chimera for refinement situations
  6. Lauren: Implement Roulette fragment picking into graph and random as an option
  7. Lauren: Change scaling factor to a function of decay (currently a straight line to lowest score cutoff)
  8. Lauren: Capping groups for post growth process (halogens and methyls)
  9. Lauren: Test simple build function with merged de novo
  10. Lauren: testing with sb2012 default values for molecules passed on to the next layer and root (looking for timing and efficiency).


Working on these now (Joe and Brian):

  1. Joe: Why is the hungarian function causing descriptor score to be so slow? Note from Brian -> maybe we can replace some of the pow() function calls.
  2. Joe: Should hungarian horizontal prune be < or <=? Should it be && or ||? What should happen if the user sets num_unmatched = 0?
  3. Brian: valgrind, memory
  4. Calculate fragment freq in generic ensembles; compare to freq of same fragments in zinc drug like molecules
    1. Would scoring by fragment efficiency
  5. Joe: Can we store ZINC ID with torsion environment
  6. We are not accounting for stereochemistry / chirality in our fragments - is this a problem?


Not working on these right now:

  1. Why do some systems take so long? 1FPU? Is the timing related to or independent of hungarian / tanimoto in descriptor score?
    1. 1FPU only takes a long time for some coordinate sets when using hungarian in descriptor score.
  2. Did we ever try standard docking with just hungarian score, and how does it affect timings? How does standard docking w/ minimized ligand affect timings?
    1. Yes, 95% success. Takes on average 3x as long as using grid but individual systems can take anywhere from 1x (same) to 14x (if finished...) amount of time compared to grid energy. For a smaller set I ran pose reproduction with hungarian score alone using two sets of initial coordinates (lig.am1bcc and lig.multigridmin) this data is better behaved than expected but we still saw that some systems took 40% longer for one set of coordinates than another.
  3. Do we want to change any of the defaults for the proposed input file?
  4. Can we settle on a reasonable size for a generic library?
  5. Capping groups on or off for this release?


Completed:

  1. Are we satisfied with just using hungarian in horizontal pruning, rather than hungarian and tanimoto?
    1. Yep, but maybe we are still working on finding the right combination of parameters
  2. How long should we keep doing tests with grid score, and when should we start using multigrid score?
    1. we are using it now
  3. Joe: Does Yuchen's proposed residue number bug fix affect the de novo code?
    1. It seems so to me. I confirmed that it did not break anything in the de novo code.
    2. Scott and Dwight also confirmed that it worked, Scott referenced an earlier bug report that mentioned this same issue.
    3. I think Scott did / is going to commit this change.
  4. Joe: How many torsions should we be keeping for each fragment added? What should be the torsion rmsd cutoff?
    1. After some consideration, I think we should use the exact same heuristic as grow_periphery in conf_gen_ag: if (rank / rmsd > cutoff) then (prune) else (keep)
    2. I implemented this. By default we use the same pruning_clustering_cutoff = 100.0 (called dn_pruning_clustering_cutoff)
  5. Is visited flag working correctly together with max_current_ aps to control when scaffolds vs. linkers vs. sidechains can be added?
  6. Brian: Is scaffold_this_layer working correctly for all sampling functions?
    1. <strike>new bool construction needs to be implemented in exhaustive
    2. <strike>in the process of modifying code to allow user input of this param
    3. it was called as soon as you would try a scaffold and not after it becomes part of the ensemble
    4. for rand and exhaustive we now only mark it if the fragment made it through the torenv check but there are cases where it will not make it through sample minimized torsions, so it is still flagged too often
    5. for graph we picked a point and assumed that scaffold_this_layer was true, this has been changed so that you at least consider a scaffold, as with the other sampling functions still needs to be fixed to make sure sample_minimized_torsions actually returned a molecule
  7. picked flag
  8. Brian: In standard pose reproduction, how many copies of each anchor do we keep? Is this comparable to how many we are keeping in de novo?
    1. on average we keep 66.6 orientations per anchor / 166.5 orientations per molecule for pose reproduction
    2. started run with 150 root and layer size
  9. Joe: Do we need to restructure the orient code so it is more like anchor and grow?
    1. This is now done. Here is what happens now:
    2. If orient_ligand is yes, then either choose up to dn_unique_anchors fragments from the anchor file, or if the anchor file is not provided, choose them from the other fraglibs. Then, one at a time, generate up to max_orientations orients for an anchor, put them all in root, and go through de novo growth. Repeat the process for all other anchors. Each starting orient should have its own set of checkpoint files and orient files, but they can share the final output file.
    3. If orient_ligand is no, then do the exact same thing above except do not enter the orient_fragments function. Thus, each new instance of de novo growth will start with exactly one pose of one anchor in root.
  10. Joe: How should we reconcile sample_minimized_torsions - maybe make it an input parameter? How many torsions to keep?
    1. I ended up modifying it to be more like anchor and grow, and use the rank / rmsd pruning heuristic.
    2. To support that change, we added the dn_pruning_conformer_score_cutoff / dn_pruning_clustering_cutoff parameters.
  11. Brian: Are we still seeing differences between minimized and unminimized fragment libraries? What is the cause of this? (e.g. 1N46, 1O3D, 1O2I, 3K5V)
    1. Hungarian score is the issue in some circumstances, but there are divergences in the ensembles and timing even in cases with grid score alone.
  12. Joe: Why are there so many systems with 0 molecules built? Too aggressive pruning? Not enough starting anchors?
    1. dn_constraint_charge_groups
    2. I think Brian figured this one out - some of the parameters were a bit too restrictive, but the major thing was that wildcards in the torenvs screwed up the alphabetization, and torenv pairs that should have passed the check did not. Brian fixed it by checking the torenv pair in the forward order, then in the reverse order, in two consecutive loops.
  13. Brian: Score ranges.
    1. Separated internal energy cutoff and pruning conformer score cutoff.
  14. Brian: .out ensemble size filter
  15. Brian: formal charges
    1. 1DQX, 1EIX, 1LOR differences between gasteiger and am1bcc formal charges, need to update gasteiger function
  16. Brian: is improvement flag ever triggered?find example where we generate torsions from a breadth search
  17. Brian: should fraggraph_vec structure be changed so that we don't sample different fragments that are identical in tanimoto space (atom typing issues?)
  18. Brian: Prune fragments if they exceed cutoffs immediately after combine_fragments function. put in sample minimized torsions, not removed from end of layer
  19. Brian: Use the current_score of the best new torsion to decide if we should keep this torsion in sample_fraglib_graph.
  20. Brian: Change starting score in graph function to be current_score of mol from last layer instead of 0.


List of features that we definitely want for the release:

Task Owner Complete?
Overhaul the simple build function BCF
When minimizing with descriptor score, make sure fingerprint is turned off xxx
Speed up fingerprint calculations by saving reference ligand as a permanent object WJA yep
Add pre-min conformations to growth trees WJA yep
Add verbose flag options WJA yep
Put molecular properties (RB, MW, etc) in mol2 header WJA yep
Put ensemble properties (RB, MW, etc) output stream at the end of each layer WJA yep
Check formal charge prune BCF yep
Combination of horizontal pruning metrics (let's consider dropping tanimoto prune and just using hungarian prune) WJA yep
Finish implementing growth trees WJA yep
Revisit orienting to make sure it is working as intended WJA yep
Fixed a bug where we were marking scaffold_this_layer as true for any fragment WJA yep
Update random sampling function to use last layer changes in graph function WJA yep
Do that same thing for the exhaustive function WJA yep
I don't think we ever clear the scaf_link_sid vector, we definitely should do that somewhere WJA yep
Update exhaustive to combine all frags into one library, just like graph / random. WJA yep


List of features that would be nice to have, but not necessary for release:

  • Stereo centers / volume overlap pruning
  • Capping group functions (H, CH3, Halogen)
  • Incorporate GA at the end of each layer
  • Overhaul the simple-build function
  • Monte carlo algorithm that checks bond frequency
  • Scaling max root / layer size with layer
  • Select torenv before selecting fragment. Will need to overhaul fraggraph, will keep us from needing to assemble mols that will not pass torenv.
  • Add fragname string to restart and dump files, already done for final and fraglib files.
  • Add ZINC name to torenv table
  • Unusual behavior during library generation when frequency cutoff == 0
  • Print out how many molecules cannot be capped. (Difference between ensemble size and dump.)
  • building from anchor 0 -> building from scf.98
  • Possible torenv check for dump molecules after capping before printing.


List of SB2012 systems that we will use for tests:

For now, let's use 5-15 rotatable bonds inclusive; total = 709 systems. [Joe]: I like this set because it represents "drug-like" size molecules, and they should not be too easy / too hard.

{5RB = 107; 6RB = 96; 7RB = 103; 8RB = 75; 9RB = 66; 10RB = 75; 11RB = 57; 12RB = 41; 13RB = 38; 14RB = 26; 15RB = 25}


Parameters to screen

 for DN_HEUR_UNMATCHED_NUM in 2 5 8 ; do
 for DN_HEUR_MATCHED_RMSD in 1.0 2.0 3.0 ; do
 for DN_UNIQUE_ANCHORS in 3 6 9 ; do
 for DN_MAX_GROW_LAYERS in 7 9 ; do
 for DN_MAX_CURRENT_APS in 5 7 ; do
 for DN_MAX_ROOT_SIZE in 50 100; do
 for DN_MAX_LAYER_SIZE in 50 100; do


The input file we should be using:

Note: these are currently all the defaults (except for scoring function). Once we figure out the best set of parameters for both (1) focused and (2) generic libraries, then we need to update the defaults accordingly.

conformer_search_type                                        denovo
dn_fraglib_scaffold_file                                     fraglib_scaffold.mol2
dn_fraglib_linker_file                                       fraglib_linker.mol2
dn_fraglib_sidechain_file                                    fraglib_sidechain.mol2
dn_user_specified_anchor                                     yes
dn_fraglib_anchor_file                                       fraglib_anchor.mol2
dn_use_torenv_table                                          yes
dn_torenv_table                                              torenv_table.dat
dn_sampling_method                                           graph
dn_graph_starting_points                                     10
dn_graph_breadth                                             5
dn_graph_depth                                               2
dn_graph_temperature                                         100.0
dn_pruning_conformer_score_cutoff                            100.0
dn_pruning_conformer_score_scaling_factor                    1
dn_pruning_clustering_cutoff                                 100.0
dn_constraint_mol_wt                                         1000.0
dn_constraint_rot_bon                                        15
dn_constraint_formal_charge                                  2.0
dn_heur_unmatched_num                                        1
dn_heur_matched_rmsd                                         2.0
dn_unique_anchors                                            3
dn_max_grow_layers                                           9
dn_max_root_size                                             100
dn_max_layer_size                                            100
dn_max_current_aps                                           5
dn_max_scaffolds_per_layer                                   1
dn_write_checkpoints                                         yes
dn_write_prune_dump                                          no
dn_write_orients                                             no
dn_write_growth_trees                                        no
dn_output_prefix                                             output
use_internal_energy                                          yes
internal_energy_rep_exp                                      12
internal_energy_cutoff                                       100.0
use_database_filter                                          no
orient_ligand                                                yes
automated_matching                                           yes
receptor_site_file                                           receptor.sph
max_orientations                                             1000
critical_points                                              no
chemical_matching                                            no
use_ligand_spheres                                           no
bump_filter                                                  no
score_molecules                                              yes
contact_score_primary                                        no
contact_score_secondary                                      no
grid_score_primary                                           no
grid_score_secondary                                         no
multigrid_score_primary                                      no
multigrid_score_secondary                                    no
dock3.5_score_primary                                        no
dock3.5_score_secondary                                      no
continuous_score_primary                                     no
continuous_score_secondary                                   no
footprint_similarity_score_primary                           no
footprint_similarity_score_secondary                         no
ph4_score_primary                                            no
ph4_score_secondary                                          no
descriptor_score_primary                                     yes
descriptor_score_secondary                                   no
descriptor_use_grid_score                                    no
descriptor_use_multigrid_score                               yes
descriptor_use_pharmacophore_score                           no
descriptor_use_tanimoto                                      yes
descriptor_use_hungarian                                     yes
descriptor_multigrid_score_rep_rad_scale                     1
descriptor_multigrid_score_vdw_scale                         1
descriptor_multigrid_score_es_scale                          1
...
descriptor_multigrid_score_number_of_grids                   N
descriptor_multigrid_score_grid_prefix0                      grid0
descriptor_multigrid_score_grid_prefixN-2                    gridN
descriptor_multigrid_score_grid_prefixN-1                    grid_remaining
...
descriptor_multigrid_score_fp_ref_mol                        yes
descriptor_multigrid_score_footprint_ref                     ../001.files/8ABP.lig.multigridmin.mol2
descriptor_multigrid_score_use_euc                           yes
descriptor_multigrid_score_use_norm_euc                      no
descriptor_multigrid_score_use_cor                           no
descriptor_multigrid_vdw_euc_scale                           1
descriptor_multigrid_es_euc_scale                            1
descriptor_fingerprint_ref_filename                          ../001.files/8ABP.lig.multigridmin.mol2
descriptor_hungarian_ref_filename                            ../001.files/8ABP.lig.am1bcc.mol2
descriptor_hungarian_matching_coeff                          -5
descriptor_hungarian_rmsd_coeff                              1
descriptor_weight_multigrid_score                            1
descriptor_weight_fingerprint_tanimoto                       0
descriptor_weight_hungarian                                  0
gbsa_zou_score_secondary                                     no
gbsa_hawkins_score_secondary                                 no
SASA_descriptor_score_secondary                              no
amber_score_secondary                                        no
minimize_ligand                                              yes
minimize_anchor                                              yes
minimize_flexible_growth                                     yes
use_advanced_simplex_parameters                              no
simplex_max_cycles                                           1
simplex_score_converge                                       0.1
simplex_cycle_converge                                       1.0
simplex_trans_step                                           1.0
simplex_rot_step                                             0.1
simplex_tors_step                                            10.0
simplex_anchor_max_iterations                                500
simplex_grow_max_iterations                                  500
simplex_grow_tors_premin_iterations                          0
simplex_random_seed                                          0
simplex_restraint_min                                        no
atom_model                                                   all
vdw_defn_file                                                vdw.defn
flex_defn_file                                               flex.defn
flex_drive_file                                              flex_drive.tbl