Analog Library - Pubchem & Zinc
Hello! This short write up is designed to make it easier for the group and other users that may stumble across this writeup curate a library of compounds analogous to compounds identified experimentally as active for the purpose of a secondary or follow-up virtual screen. First, we want a list of the actives compound's ZINC ID and smiles string. The example Il use here is ZINC000019831888 who's smile string is: OC(COC=1C=CC(=CC1)C(=O)C=2C=CC=CC2)CN3CCN(CC3)C=4C=CC=CC4Cl.
After collecting the pertinent information for the compounds were interested in, we can head to https://pubchem.ncbi.nlm.nih.gov This will bring up a screen that looks like this:
We want to select the Structure Search bar on the right hand side of the screen:
Then we are taken to this page:
https://pubchem.ncbi.nlm.nih.gov/search/search.cgi
We want to select the Identity/Similarity tab:
That will bring up this screen:
From here we want to select the "CID, Smiles or InChl" tab, paste in our smiles string, then we can select some parameters (Tanimoto greater than 0.80 and compounds only from ZINC):
After clicking search, we will be brought to a screen like this and we can select the "send to" tab on the right hand side:
After clicking create file, a file will be downloaded to our downloads directory and look something like this:
From here we can grep out all of the ZINC ids obtained from the search using something along the lines of
grep -o 'ZINC' path_to_summary_file.out > all_zinc_ids.txt
Here is an example of what all_zinc_ids.txt might look like:
Caveat emptor!: The list of zinc ids may contain trailing punctuation, in my experience it has been semi-colons that should be removed before querying ZINC. This can be done pretty simply using awk or sed. It is best to peruse each list of ZINC ids individually provided they aren't prohibitively large.
Depending on the size of the analog library collected (list of zinc IDs collected) we can break them up into chunks containing 1000 IDs. This can be done using the split command from the terminal.
From here our library curation turns to the ZINC database to download the 3-dimensional structures for docking.
If we are on the zinc15 substances page ( http://zinc15.docking.org/substances/home/ ) we can click the "choose file" tab user search using many and upload our all_zinc_ids.txt file to start a query. The only parameter to change is that we want our output in mol2 format rather than summary format (see picture):
After selecting the appropriate list of ZINC ids and selecting the mol2 output you can click the "Search many" box. This will eventually begin downloading a file to your downloads directory, usually called resolved-3.mol2 or something along this lines. It is important to move this resulting mol2 file to an appropriate directory and rename it accordingly!