Difference between revisions of "Automated Families generation from the Protein Databank"

From Rizzo_Lab
Jump to: navigation, search
 
(Script to download and match protein families)
 
(4 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
   
 
   
 
  description of actual step-by-step protocol ( a numbered list)  and  
 
  description of actual step-by-step protocol ( a numbered list)  and  
  inclusion of screen shots along with specific examples of one of kinase families.
+
  inclusion of screen shots along with specific examples of a kinase family.
 
   
 
   
 
The last steps should include
 
The last steps should include
Line 14: Line 14:
 
  - a picture of the final aligned strutures.
 
  - a picture of the final aligned strutures.
 
   
 
   
 
+
The RSCB Protein Data Bank (PDB) is the largest database of its kind, containing detailed information regarding all kinds of proteins. The full usage of the search bar will allow for future finding of similar proteins with ligands, which can be used for docking and other similar processes. The optimization of this task will allow for less time being required to find them and more time spent on the intricate details regarding the kinases.
 
 
    The RSCB Protein Data Bank (PDB) is the largest database of its kind, containing detailed information regarding all kinds of proteins. The full usage of the search bar will allow for future finding of similar proteins with ligands, which can be used for docking and other similar processes. The optimization of this task will allow for less time being required to find them and more time spent on the intricate details regarding the kinases.
 
  
 
The first step for doing so is to search for the individual protein you are working with. The PDB will pull up the file in the it (which is available for download if such is desired) which contains specific details on it. The EC number, explanation of exactly the protein is responsible for, its various names and associated categories, and crystal structure are just some of the data displayed.
 
The first step for doing so is to search for the individual protein you are working with. The PDB will pull up the file in the it (which is available for download if such is desired) which contains specific details on it. The EC number, explanation of exactly the protein is responsible for, its various names and associated categories, and crystal structure are just some of the data displayed.
 
  
 
The next step would require looking at the structure of the chains of the protein. Examining this will ensure that it is the type (chemically) that you want to work with.Detail on the number of residues and amino acid sequence will be depicted.
 
The next step would require looking at the structure of the chains of the protein. Examining this will ensure that it is the type (chemically) that you want to work with.Detail on the number of residues and amino acid sequence will be depicted.
  
 +
The tab next to sequence is sequence similarity. Cluster sequence cut off percentages will be displayed, as well as rank, number of chains in the cluster, and the cluster number. Typical examples of similarity amounts include 100%, 95%, 90%, 70%, 50%, and 40%. There are different approximate percentages depending on the individual protein (such as 60% and 85%), as they will based on the number of similar proteins that have been discovered. Clicking on the desired percentage similarly will give a list of proteins with similar chemical sequences.
  
The tab next to sequence is sequence similarity. Cluster sequence cut off percentages will be displayed, as well as rank, number of chains in the cluster, and the cluster number. Typical examples of similarity amounts include 100%, 95%, 90%, 70%, 50%, and 40%. There are different approximate percentages depending on the individual protein (such as 60% and 85%), as they will based on the number of similar proteins that have been discovered. Clicking on the desired percentage similarly will give a list of proteins with similar chemical sequences.
+
The final step requires personal review of the results given to make sure it is the same type of protein and whether or not it has ligands. From there, it is possible to create a list to work with when eventually moving on to docking.
  
 +
The usage of the search protocol of PDB is a useful tool to find proteins that are similar to others. Using a UNIX written program was the original approach. However, it proved to be unsuccessful and unreliable to do the inability of the program to do comparisons if the size of molecules were quite large.
  
The final step requires personal review of the results given to make sure it is the same type of protein and whether or not it has ligands. From there, it is possible to create a list to work with when eventually moving on to docking.
+
==Script to download and match protein families==
 +
For this script, you need a list of pdbcodes that are likely to be in the same family. A search on the PDB, or searching by EC# is a good way of obtaining this list. Store the codes in file, e.g. pdbcode_list. Select a pdb file as the reference. Ideally, this would be the wild type protein with a high quality xray structure without missing residues. In the example below, it is 1CKP.pdb. Include the reference file in the pdbcode_list as a control.
  
 +
foreach pdbcode (`cat $1`)
 +
  rm -fr /home/bshea/kinase_download/$pdbcode
 +
  mkdir /home/bshea/kinase_download/$pdbcode
 +
  cd /home/bshea/kinase_download/$pdbcode
 +
  wget -q http://www.pdb.org/pdb/files/${pdbcode}.pdb.gz
 +
  gunzip ${pdbcode}.pdb.gz
 +
  echo -n "Matching $pdbcode : "
 +
##############################################
 +
cat << EOF > chimera.com
 +
open /home/bshea/kinase_download/1CKP.pdb
 +
open /home/bshea/kinase_download/$pdbcode/$pdbcode.pdb
 +
mmaker #0 #1 pair ss ss false iter 2.0
 +
write format pdb 1 ${pdbcode}.matched.pdb
 +
EOF
 +
#############################################
 +
  chimera --nogui chimera.com > chimera.out
 +
  grep "RMSD between" chimera.out
 +
end
  
The usage of the search protocol of PDB is a useful tool to find proteins that are similar to others. Using a UNIX written program was the original approach. However, it proved to be unsuccessful and unreliable to do the inability of the program to do comparisons if the size of molecules were quite large.
+
Export the output to excel (or similar), sort by the number of CA atoms matched and the RMSD. Any PDB code with a poor match should be eliminated. The rest are now ready to be manually added to the testset.

Latest revision as of 12:35, 12 August 2011

This is under construction: Things to come:

description of actual step-by-step protocol ( a numbered list)  and 
inclusion of screen shots along with specific examples of a kinase family.

The last steps should include

- how to download all the files

- alignement of the files using chimera

- a picture of the relevant chimera output which says how many residues were aligned and what the rmsd is

- a picture of the final aligned strutures.

The RSCB Protein Data Bank (PDB) is the largest database of its kind, containing detailed information regarding all kinds of proteins. The full usage of the search bar will allow for future finding of similar proteins with ligands, which can be used for docking and other similar processes. The optimization of this task will allow for less time being required to find them and more time spent on the intricate details regarding the kinases.

The first step for doing so is to search for the individual protein you are working with. The PDB will pull up the file in the it (which is available for download if such is desired) which contains specific details on it. The EC number, explanation of exactly the protein is responsible for, its various names and associated categories, and crystal structure are just some of the data displayed.

The next step would require looking at the structure of the chains of the protein. Examining this will ensure that it is the type (chemically) that you want to work with.Detail on the number of residues and amino acid sequence will be depicted.

The tab next to sequence is sequence similarity. Cluster sequence cut off percentages will be displayed, as well as rank, number of chains in the cluster, and the cluster number. Typical examples of similarity amounts include 100%, 95%, 90%, 70%, 50%, and 40%. There are different approximate percentages depending on the individual protein (such as 60% and 85%), as they will based on the number of similar proteins that have been discovered. Clicking on the desired percentage similarly will give a list of proteins with similar chemical sequences.

The final step requires personal review of the results given to make sure it is the same type of protein and whether or not it has ligands. From there, it is possible to create a list to work with when eventually moving on to docking.

The usage of the search protocol of PDB is a useful tool to find proteins that are similar to others. Using a UNIX written program was the original approach. However, it proved to be unsuccessful and unreliable to do the inability of the program to do comparisons if the size of molecules were quite large.

Script to download and match protein families

For this script, you need a list of pdbcodes that are likely to be in the same family. A search on the PDB, or searching by EC# is a good way of obtaining this list. Store the codes in file, e.g. pdbcode_list. Select a pdb file as the reference. Ideally, this would be the wild type protein with a high quality xray structure without missing residues. In the example below, it is 1CKP.pdb. Include the reference file in the pdbcode_list as a control.

foreach pdbcode (`cat $1`)
 rm -fr /home/bshea/kinase_download/$pdbcode
 mkdir /home/bshea/kinase_download/$pdbcode
 cd /home/bshea/kinase_download/$pdbcode
 wget -q http://www.pdb.org/pdb/files/${pdbcode}.pdb.gz
 gunzip ${pdbcode}.pdb.gz
 echo -n "Matching $pdbcode : "
##############################################
cat << EOF > chimera.com
open /home/bshea/kinase_download/1CKP.pdb
open /home/bshea/kinase_download/$pdbcode/$pdbcode.pdb
mmaker #0 #1 pair ss ss false iter 2.0
write format pdb 1 ${pdbcode}.matched.pdb
EOF
#############################################
 chimera --nogui chimera.com > chimera.out
 grep "RMSD between" chimera.out
end

Export the output to excel (or similar), sort by the number of CA atoms matched and the RMSD. Any PDB code with a poor match should be eliminated. The rest are now ready to be manually added to the testset.