Difference between revisions of "Automated Families generation from the Protein Databank"
(→Script to download and match protein families) |
|||
(4 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
description of actual step-by-step protocol ( a numbered list) and | description of actual step-by-step protocol ( a numbered list) and | ||
− | inclusion of screen shots along with specific examples of | + | inclusion of screen shots along with specific examples of a kinase family. |
The last steps should include | The last steps should include | ||
Line 14: | Line 14: | ||
- a picture of the final aligned strutures. | - a picture of the final aligned strutures. | ||
− | + | The RSCB Protein Data Bank (PDB) is the largest database of its kind, containing detailed information regarding all kinds of proteins. The full usage of the search bar will allow for future finding of similar proteins with ligands, which can be used for docking and other similar processes. The optimization of this task will allow for less time being required to find them and more time spent on the intricate details regarding the kinases. | |
− | |||
− | |||
The first step for doing so is to search for the individual protein you are working with. The PDB will pull up the file in the it (which is available for download if such is desired) which contains specific details on it. The EC number, explanation of exactly the protein is responsible for, its various names and associated categories, and crystal structure are just some of the data displayed. | The first step for doing so is to search for the individual protein you are working with. The PDB will pull up the file in the it (which is available for download if such is desired) which contains specific details on it. The EC number, explanation of exactly the protein is responsible for, its various names and associated categories, and crystal structure are just some of the data displayed. | ||
− | |||
The next step would require looking at the structure of the chains of the protein. Examining this will ensure that it is the type (chemically) that you want to work with.Detail on the number of residues and amino acid sequence will be depicted. | The next step would require looking at the structure of the chains of the protein. Examining this will ensure that it is the type (chemically) that you want to work with.Detail on the number of residues and amino acid sequence will be depicted. | ||
+ | The tab next to sequence is sequence similarity. Cluster sequence cut off percentages will be displayed, as well as rank, number of chains in the cluster, and the cluster number. Typical examples of similarity amounts include 100%, 95%, 90%, 70%, 50%, and 40%. There are different approximate percentages depending on the individual protein (such as 60% and 85%), as they will based on the number of similar proteins that have been discovered. Clicking on the desired percentage similarly will give a list of proteins with similar chemical sequences. | ||
− | The | + | The final step requires personal review of the results given to make sure it is the same type of protein and whether or not it has ligands. From there, it is possible to create a list to work with when eventually moving on to docking. |
+ | The usage of the search protocol of PDB is a useful tool to find proteins that are similar to others. Using a UNIX written program was the original approach. However, it proved to be unsuccessful and unreliable to do the inability of the program to do comparisons if the size of molecules were quite large. | ||
− | + | ==Script to download and match protein families== | |
+ | For this script, you need a list of pdbcodes that are likely to be in the same family. A search on the PDB, or searching by EC# is a good way of obtaining this list. Store the codes in file, e.g. pdbcode_list. Select a pdb file as the reference. Ideally, this would be the wild type protein with a high quality xray structure without missing residues. In the example below, it is 1CKP.pdb. Include the reference file in the pdbcode_list as a control. | ||
+ | foreach pdbcode (`cat $1`) | ||
+ | rm -fr /home/bshea/kinase_download/$pdbcode | ||
+ | mkdir /home/bshea/kinase_download/$pdbcode | ||
+ | cd /home/bshea/kinase_download/$pdbcode | ||
+ | wget -q http://www.pdb.org/pdb/files/${pdbcode}.pdb.gz | ||
+ | gunzip ${pdbcode}.pdb.gz | ||
+ | echo -n "Matching $pdbcode : " | ||
+ | ############################################## | ||
+ | cat << EOF > chimera.com | ||
+ | open /home/bshea/kinase_download/1CKP.pdb | ||
+ | open /home/bshea/kinase_download/$pdbcode/$pdbcode.pdb | ||
+ | mmaker #0 #1 pair ss ss false iter 2.0 | ||
+ | write format pdb 1 ${pdbcode}.matched.pdb | ||
+ | EOF | ||
+ | ############################################# | ||
+ | chimera --nogui chimera.com > chimera.out | ||
+ | grep "RMSD between" chimera.out | ||
+ | end | ||
− | + | Export the output to excel (or similar), sort by the number of CA atoms matched and the RMSD. Any PDB code with a poor match should be eliminated. The rest are now ready to be manually added to the testset. |
Latest revision as of 12:35, 12 August 2011
This is under construction: Things to come:
description of actual step-by-step protocol ( a numbered list) and inclusion of screen shots along with specific examples of a kinase family.
The last steps should include
- how to download all the files - alignement of the files using chimera - a picture of the relevant chimera output which says how many residues were aligned and what the rmsd is - a picture of the final aligned strutures.
The RSCB Protein Data Bank (PDB) is the largest database of its kind, containing detailed information regarding all kinds of proteins. The full usage of the search bar will allow for future finding of similar proteins with ligands, which can be used for docking and other similar processes. The optimization of this task will allow for less time being required to find them and more time spent on the intricate details regarding the kinases.
The first step for doing so is to search for the individual protein you are working with. The PDB will pull up the file in the it (which is available for download if such is desired) which contains specific details on it. The EC number, explanation of exactly the protein is responsible for, its various names and associated categories, and crystal structure are just some of the data displayed.
The next step would require looking at the structure of the chains of the protein. Examining this will ensure that it is the type (chemically) that you want to work with.Detail on the number of residues and amino acid sequence will be depicted.
The tab next to sequence is sequence similarity. Cluster sequence cut off percentages will be displayed, as well as rank, number of chains in the cluster, and the cluster number. Typical examples of similarity amounts include 100%, 95%, 90%, 70%, 50%, and 40%. There are different approximate percentages depending on the individual protein (such as 60% and 85%), as they will based on the number of similar proteins that have been discovered. Clicking on the desired percentage similarly will give a list of proteins with similar chemical sequences.
The final step requires personal review of the results given to make sure it is the same type of protein and whether or not it has ligands. From there, it is possible to create a list to work with when eventually moving on to docking.
The usage of the search protocol of PDB is a useful tool to find proteins that are similar to others. Using a UNIX written program was the original approach. However, it proved to be unsuccessful and unreliable to do the inability of the program to do comparisons if the size of molecules were quite large.
Script to download and match protein families
For this script, you need a list of pdbcodes that are likely to be in the same family. A search on the PDB, or searching by EC# is a good way of obtaining this list. Store the codes in file, e.g. pdbcode_list. Select a pdb file as the reference. Ideally, this would be the wild type protein with a high quality xray structure without missing residues. In the example below, it is 1CKP.pdb. Include the reference file in the pdbcode_list as a control.
foreach pdbcode (`cat $1`) rm -fr /home/bshea/kinase_download/$pdbcode mkdir /home/bshea/kinase_download/$pdbcode cd /home/bshea/kinase_download/$pdbcode wget -q http://www.pdb.org/pdb/files/${pdbcode}.pdb.gz gunzip ${pdbcode}.pdb.gz echo -n "Matching $pdbcode : " ############################################## cat << EOF > chimera.com open /home/bshea/kinase_download/1CKP.pdb open /home/bshea/kinase_download/$pdbcode/$pdbcode.pdb mmaker #0 #1 pair ss ss false iter 2.0 write format pdb 1 ${pdbcode}.matched.pdb EOF ############################################# chimera --nogui chimera.com > chimera.out grep "RMSD between" chimera.out end
Export the output to excel (or similar), sort by the number of CA atoms matched and the RMSD. Any PDB code with a poor match should be eliminated. The rest are now ready to be manually added to the testset.