3D Analog Library Generation Using Pubchem and Zinc

From Rizzo_Lab
Jump to: navigation, search

Hello! This short write up is designed to make it easier for the group and other users that may stumble across this writeup curate a library of compounds analogous to compounds identified experimentally as active for the purpose of a secondary or follow-up virtual screen. First, we want a list of the actives compound's ZINC ID and smiles string. The example Il use here is ZINC000019831888 who's smile string is: OC(COC=1C=CC(=CC1)C(=O)C=2C=CC=CC2)CN3CCN(CC3)C=4C=CC=CC4Cl.


After collecting the pertinent information for the compounds were interested in, we can head to https://pubchem.ncbi.nlm.nih.gov This will bring up a screen that looks like this:

Screen Shot 2018-05-07 at 12.00.26 PM.png



We want to select the Structure Search bar on the right hand side of the screen:

Structure search selection pubchem.JPG



Then we are taken to this page: https://pubchem.ncbi.nlm.nih.gov/search/search.cgi

Screen Shot 2018-05-07 at 12.17.39 PM.png



We want to select the Identity/Similarity tab:

Select similarity Search.JPG



That will bring up this screen:

Similarity search window.JPG



From here we want to select the "CID, Smiles or InChl" tab, paste in our smiles string, then we can select some parameters (Tanimoto greater than 0.80 and compounds only from ZINC):

Similarity search w parameters.png

Pubchem paramters for sim search.png



After clicking search, we will be brought to a screen like this and we can select the "send to" tab on the right hand side:

Pubchem search output.png


Send to tab pubchem.png



After clicking create file, a file will be downloaded to our downloads directory and look something like this:

Compound summary pubchem.png



From here we can grep out all of the ZINC ids obtained from the search using something along the lines of

grep -o 'ZINC' path_to_summary_file.out > all_zinc_ids.txt

Here is an example of what all_zinc_ids.txt might look like:

Zinc id list for database 3d search.png



Caveat emptor!: The list of zinc ids may contain trailing punctuation, in my experience it has been semi-colons that should be removed before querying ZINC. This can be done pretty simply using awk or sed. It is best to peruse each list of ZINC ids individually provided they aren't prohibitively large.

Depending on the size of the analog library collected (list of zinc IDs collected) we can break them up into chunks containing 1000 IDs. This can be done using the split command from the terminal.

From here our library curation turns to the ZINC database to download the 3-dimensional structures for docking.

If we are on the zinc15 substances page ( http://zinc15.docking.org/substances/home/ ) we can click the "choose file" tab user search using many and upload our all_zinc_ids.txt file to start a query. The only parameter to change is that we want our output in mol2 format rather than summary format (see picture):

Screen Shot 2018-05-22 at 10.15.22 AM.png


After selecting the appropriate list of ZINC ids and selecting the mol2 output you can click the "Search many" box. This will eventually begin downloading a file to your downloads directory, usually called resolved-3.mol2 or something along this lines. It is important to move this resulting mol2 file to an appropriate directory and rename it accordingly!