MD Simulation: Protein in Water (Pt. 2)
Contents
MD Simulation: Protein in Water (Pt. 1)
Equilibration MD
Equilibration will be carried out in two steps. First, an NVT (constant Number of atoms, Volume, and Temperature) simulation will be performed in order to bring the system to the target temperature. Second, an NPT (constant Number of atoms, Pressure, and Temperature) simulation will be performed to allow the system to find the correct density.
NVT Simulation
A parameter file for an NVT simulation can be found here: ubq_nvt.mdp. This file calls for an MD run of 25,000 steps with a 2 fs timestep (a total of 50 ps). The 'define = -DPOSRES' line initiates the backbone restraints on the protein, which should remain on until production MD. Temperature coupling is performed using the Berendsen method. In order to be sure velocities (and therefore temperature) are evenly distributed across all of the molecule types, it is a good idea to couple protein atoms and non-protein atoms to separate baths. Without separate coupling, you run the risk weird phenomena such as a system where the water molecules are at 350 K and the protein is at 200 K. Finally, all bonds are constrained using the linear constraint solver (LINCS) algorithm. Review the GROMACS manual if you are unfamiliar with the implementation or purpose of any of these parameters.
Execute the command:
grompp -f ubq_nvt.mdp -c ubiquitin_min.gro -p topol.top -o input_nvt.tpr
Read the output carefully. If it is okay to proceed, then:
mdrun -s input_nvt.tpr -deffnm ubiquitin_nvt -v
This step will take longer than the energy minimization. However, GROMACS automatically threads an 'mdrun' job across all of the processors available on the host machine. If you executed this command on George, it will try to use all 8 processors and the job should only take about 5-10 minutes. If you would like to run 'mdrun' in parallel on Cluster, the command line should look something like:
mpirun -n 16 mdrun -s input_nvt.tpr -deffnm ubiquitin_nvt -np 16
A good rule of thumb is that for a given system you will achieve the most efficient scaling if you use N processors, where N = <# of atoms> / 1000. At the completion of the job, you will find the 4 types of output files that you saw following energy minimization, in addition to 2 new output files:
1.) ubiquitin_nvt.gro # system coordinates at the final step 2.) ubiquitin_nvt.trr # full-precision trajectory file 3.) ubiquitin_nvt.log # log file 4.) ubiquitin_nvt.edr # energy file 5.) ubiquitin_nvt.xtc # compressed trajectory file 6.) ubiquitin_nvt.cpt # checkpoint file
The .xtc file is another trajectory file. For the same number of frames, the .xtc file is much smaller than the .trr file because it contains only atomic coordinates, whereas the .trr file also contains full-precision velocities and forces, which are only useful for continuing runs. In practice, you may find it most convenient to only write to the .trr file once at the end of a simulation, and to write to the .xtc file much more frequently (e.g., every 1 ps) for analysis. To view the trajectory, load the original coordinate file (ubiquitin_min.gro) into VMD, followed by the .xtc file. The final file, 'ubiquitin_nvt.cpt', is a new-ish implementation in GROMACS that renders the .trr file obsolete. The checkpoint file is by default written to disc every 15 minutes (and always at the very end of a simulation), and it contains the latest full-precision velocities, forces, and energies that are required to continue a simulation, or restart a crashed-simulation. This will be demonstrated in the next section, but first, it is important to verify that the ubiquitin system equilibrated around the target temperature. Use 'g_energy' again, and this time choose '14' for Temperature.
g_energy -f ubiquitin_nvt.edr -o energy.xvg
Visualize the resulting plot in Grace:
System temperature as a function of time during NVT equilibration.
From this plot it is apparent that the system is well equilibrated around the target temperature (310 K), and it is okay to proceed to the NPT equilibration step.
NPT Simulation
Following equilibration to a target temperature, you must next relax your system into a constant pressure ensemble. A parameter file for NPT simulation can be found here: ubq_npt.mdp. As you can see, this parameter file calls for a run of 250,000 steps with a 2 fs timestep (a total of 500 ps). Position restraints are still being applied to the backbone. The Berendsen thermostat has been replaced with a Nose-Hoover thermostat, which will produce a more correct ensemble of kinetic energies. The Nose-Hoover thermostat, however, must be implemented only after the system is already near its target temperature (which was achieved by the Berendsen thermostat), or the kinetic energy of the system will fluctuate wildly. Because this run will be a continuation of the previous run, 'gen_vel' has been set to 'no'. Finally, the Parrinello-Rahman method is used to couple pressure isotropically (the same in all directions) to a value of 1.0 bar. As soon as you have a thorough understanding of the logic behind each setting in the parameter file, pre-process your run input file with grompp:
grompp -f ubq_npt.mdp -c ubiquitin_nvt.gro -t ubiquitin_nvt.cpt -p topol.top -o input_npt.tpr
The important difference with this invocation of 'grompp' is that you specified '-t ubiquitin_nvt.cpt'. This flag indicates that the forthcoming run will be a continuation of the previous run, taking into account the full-precision velocities and forces on all atoms from the checkpoint file. If there are no Errors, then:
mdrun -s input_npt.tpr -deffnm ubiquitin_npt -v
This run requires many more steps than the NVT equilibration did, and it may take an hour or so if you run it on George. When complete. you will have the same 6 types of output files as those that resulted from the NVT simulation. Use g_energy again to tell whether the goal of the equilibration was acheived:
g_energy -f ubiquitin_npt.edr -o energy.xvg
Choose '22' for density and visualize the resulting plot in Grace:
System density as a function of time during NPT equilibration.
The system appears to have quickly equilibrated very near to a physiological value. By adding the following flags to g_energy:
g_energy -f ubiquitin_npt.edr -b 250
and again choosing '22', the program reports a density of 1002.1 kg/m^3 as averaged over the last 250 ps of simulation. At this point, it is okay to move on to production MD.
Production MD
The parameter file for production MD looks almost exactly like that of the NPT equilibration step with one important exception: position restraints on the backbone have been lifted. The parameter file can be found here: ubq_md.mdp. This file calls for 2,500,000 MD steps with a 2 fs timestep, totaling 5,000 ps (or 5 ns) of sampling. Other than the longer sampling time and the release of restraints, there are no other differences between this simulation and the NPT equilibration. To pre-process the run input file, execute:
grompp -f ubq_md.mdp -c ubiquitin_npt.gro -t ubiquitin_npt.cpt -p topol.top -o input_md.tpr
Remember, it is important to provide the checkpoint file on the command line as this is a continuation of the previous run. If there are no Errors, then begin simulation with:
mdrun -s input_md.tpr -deffnm ubiquitin_md
Running this simulation on a small linux server may take 10-12 hours. It will take significantly less time if you run it in parallel on a cluster. Examine the last 200 or so lines of the .log file to find some statistics on the speed of the run, as well as a very detailed decomposition of where all of the computation time was spent:
R E A L C Y C L E A N D T I M E A C C O U N T I N G Computing: Nodes Number G-Cycles Seconds % ----------------------------------------------------------------------- Domain decomp. 8 2500000 91520.964 32768.2 11.3 DD comm. load 8 2499011 699.920 250.6 0.1 DD comm. bounds 8 2499001 1137.751 407.4 0.1 Comm. coord. 8 2500001 3508.266 1256.1 0.4 Neighbor search 8 2500001 232396.884 83207.4 28.8 Force 8 2500001 204567.718 73243.5 25.3 Wait + Comm. F 8 2500001 4545.445 1627.5 0.6 PME mesh 8 2500001 213796.271 76547.7 26.5 Write traj. 8 5041 74.665 26.7 0.0 Update 8 2500001 14321.922 5127.8 1.8 Constraints 8 2500001 27057.311 9687.6 3.3 Comm. energies 8 2500002 3582.277 1282.6 0.4 Rest 8 11075.937 3965.6 1.4 ----------------------------------------------------------------------- Total 8 808285.331 289398.7 100.0 ----------------------------------------------------------------------- ----------------------------------------------------------------------- PME redist. X/F 8 5000002 45742.470 16377.6 5.7 PME spread/gather 8 5000002 94963.334 34000.7 11.7 PME 3D-FFT 8 5000002 52335.780 18738.3 6.5 PME solve 8 2500001 20644.201 7391.5 2.6 ----------------------------------------------------------------------- Parallel run - timing based on wallclock. NODE (s) Real (s) (%) Time: 36174.836 36174.836 100.0 10h02:54 (Mnbf/s) (GFlops) (ns/day) (hour/ns) Performance: 133.592 11.329 11.942 2.010 Finished mdrun on node 0 Fri Jul 22 11:24:04 2011
Extending a run requires two simple steps. First, the run input file is modified to specify a longer simulation time. Second, 'mdrun' is called again with some additional flags. For example, if you would like to extend this run by 5,000 ps, you would execute the following:
tpbconv -s input_md.tpr -o input_md_extended.tpr -extend 5000 mdrun -s input_md_extended.tpr -deffnm ubiquitin_md -cpi ubiquitin_md.cpt -append
The first command uses the tool 'tpbconv' to add 5,000 ps of simulation time to the .tpr file. Of course the output file name (specified with the '-o' flag) is arbitrary. In fact, the output file name can be the same as the input file name. GROMACS would automatically backup the original file and you would not need to worry about keeping track of multiple .tpr files. The '-append' flag on the second line tells 'mdrun' to append all new calculations to the end of the original output files from the first 5 ns simulation. In fact, if your run were to die because the server crashes, for example, you would just need to execute the 'mdrun' command above with your original .tpr file in order to restart it.
Analysis
GROMACS comes equipped with many analysis tools, a complete list of which can be found in the manual. Here you will be exposed to a few useful analysis tools: 'g_rms', 'g_rmsf', and 'do_dssp'. But first, it is useful to learn how to process your trajectory file to only keep the components you are interested in. For example, start with the two files:
1.) input_md.tpr # run input file for production MD 2.) ubiquitin_md.xtc # full system 5 ns trajectory
The next three analyses that you are going to perform only require protein coordinates, not water / ion coordinates. Use 'trjconv' to remove water molecules and ions from the .xtc file, then use 'tpbconv' to remove water / ion information from the .tpr file:
trjconv -s input_md.tpr -f ubiquitin_md.xtc -o protein.xtc tpbconv -s input_md.tpr -o protein.tpr
In each case, you will be prompted to choose a group. Choose group '1 (Protein)' both times. Now, use 'trjconv' again to dump out the first frame of the trajectory:
trjconv -s protein.tpr -f protein.xtc -o protein.gro -dump 0
Choose either '0 (System)' or '1 (Protein)' to dump all protein coordinates (with these modified input files, the protein is the whole system). The '-dump 0' command will write out the coordinates closest to 0 ps. At this point, it is a good idea to open protein.gro in VMD, followed by protein.xtc in order to watch the trajectory and make sure nothing funny happened during the simulation. In the production MD parameter file, you turned on center-of-mass motion removal. That is why the protein tends to stay in the center of the box during simulation. However, you will notice that it still rotates around all three axes freely. To remove that, you simply need to perform a best-fit translation and rotation calculation on each frame of the trajectory using the starting coordinates as the reference structure. To do so, execute:
trjconv -s protein.tpr -f protein.xtc -o protein_fit.xtc -fit rot+trans
It will prompt you for a group to use when calculating the least-squares fit. Traditionally, the backbone is used for such a calculation, so choose '4 (Backbone)'. Then it will ask which group you would like to write to disk. Choose '0 (System)' or '1 (Protein)'. Load this modified trajectory into VMD to see the effect.
RMSD
Root mean square deviation (RMSD) is a measure of how much the protein structure changes over the course of the simulation. Changes on the order of 1-3 Angstroms are perfectly acceptable (and expected) for small, globular proteins. Changes much larger than that, however, indicate that your protein may be misbehaving during the simulation. The other thing to look for is convergence. If the RMSD of your protein is still changing at the end of your simulation (trending upwards, usually), then your simulation is probably not long enough. To measure RMSD, use the GROMACS tool 'g_rms':
g_rms -s protein.tpr -f protein_fit.xtc -o rmsd.xvg
Choose '4 (Backbone)' for the least squares fit, and '4 (Backbone)' again for the RMSD measurement. The reference structure is taken from 'protein.tpr'. Each frame in protein_fit.xtc is iteratively fit to the reference structure (because you already did this in the previous step, you can turn the fitting off by specifying '-fit none' on the command line). The output file will be called 'rmsd.xvg', which can be visualized in Grace.
Ubiquitin backbone RMSD during a 5 ns NPT production simulation.
The black line represents the RMSD of the protein backbone during the simulation, which remains within an acceptable range. A look at the running average of the RMSD (red line) indicates that, despite this being only a 5 ns simulation, the RMSD is reasonably stable around 0.11 nm.
RMSF
RMSD is a measure of global backbone deviation. If you are interested in more local changes, you can measure the root mean square fluctuation (RMSF). The GROMACS tool 'g_rmsf' was written for that purpose. Execute the following command:
g_rmsf -s protein.tpr -f protein_fit.xtc -o rmsf.xvg -oq rmsf.pdb -res
Now you must make a decision. Are you more interested in backbone fluctuation, sidechain fluctuation, entire residue fluctuation, or something completely different? What you pick here will probably change from project to project depending on the scientific question you are trying to ask. In the case of ubiquitin, you may ask yourself, why were the occupancy values of the last four residues less than 1? (If you remember back to the crystal structure and 'pdb2gmx', you were presented with a Warning about this exactly). The occupancy values in the crystal structure applied to entire residues - not just the backbone or just the sidechain. Therefore, choose '1 (Protein)', in order to measure fluctuations of whole residues (the '-res' flag will group the output by whole residues rather than by individual atoms). There are two important output files:
1.) rmsf.xvg # a data file of RMSF values plotted by residue number that can be viewed in Grace 2.) rmsf.pdb # a structure file that contains beta-factors equal to RMSF values
The first file should be viewed in Grace:
Ubiquitin backbone RMSF averaged over 5 ns simulation.
On this plot, peaks indicate areas of the protein that fluctuated the most during the simulation. The C-terminal tail fluctuated much more than any other part of the protein, helping to explain why the crystallographers assigned such a low resolution to that area.
Open the second file, 'rmsf.pdb', in VMD and color residues by Beta. The result should look like the following:
Ubiquitin structure colored by RMSF. Red values indicate areas of more fluctuation, blue values indicate areas of less fluctuation.
Now it is easy to tell that the C-terminus of the protein fluctuates the most, and it appears that the alpha-helix is the most stable part of the protein.
As a final note on RMSF analysis, you must be aware that the RMSF plot will change depending on what part of the simulation you choose to analyze. For example, the plot seen above is the average fluctuation of each residue over the entire simulation. In practice, that is not the best way to measure RMSF. At the beginning of the simulation, certain residues may be fluctuating more as they transition from the crystal form of the protein to a fully solvated and dynamic form. It is better to simulate for several nanoseconds, and when the RMSD converges, use that as a starting point for RMSF measurements. For example if you want to measure RMSF beginning from 4 ns, the command would be:
g_rmsf -s protein.tpr -f protein_fit.xtc -o rmsf.xvg -oq rmsf.pdb -res -b 4000
DSSP
DSSP stands for Dictionary of Secondary Structure of Proteins (or something similar). The purpose of this analysis is to measure the secondary structure content of a protein as a function of time. The analysis breaks down secondary structure assignments (i.e. helix, sheet, turn, etc.) by individual residues for each time step, allowing you to visualize or quantify the data in meaningful ways. The GROMACS tool 'do_dssp' uses the DSSP library and executable from this website to measure secondary structure in your trajectory file. George already has the appropriate DSSP library and executable installed. All you have to do to use dssp is issue 'do_dssp', and GROMACS will automatically find the library and executable. Type the following:
do_dssp -f protein.xtc -s protein.tpr -sc scount.xvg -o ss.xpm -dt 10
When prompted, you must choose '5 (Mainchain)'. The trajectory saved snapshots at every 1 ps, which is a little too frequent for this type of analysis, so use the option '-dt 10' to only look at every 10th frame. The first output file, scount.xvg, can be visualized in Grace by typing:
xmgrace -nxy scount.xvg
Ubiquitin secondary structure content
As an added bonus, the end of the file 'scount.xvg' contains the following two lines:
# Totals 26169 9107 11705 1009 1506 7064 6391 1294 # SS % 0.69 0.24 0.31 0.03 0.04 0.19 0.17 0.03
where:
@ s0 legend "Structure" @ s1 legend "Coil" @ s2 legend "B-Sheet" @ s3 legend "B-Bridge" @ s4 legend "Bend" @ s5 legend "Turn" @ s6 legend "A-Helix" @ s7 legend "3-Helix"
This is a very convenient way to analyze average secondary structure content of a protein. This could be especially useful when trying to predict the secondary structure of unstructured proteins as well.
The second file, ss.xpm, contains a breakdown of secondary structure for each residue as a function of time step. Although some programs can open this type of file, it is generally easier to process it with the GROMACS tool 'xpm2ps' first. Make a copy of this file: ps.m2p, then execute the following:
xpm2ps -f ss.xpm -di ps.m2p -o ubq_ss.eps
The tool 'xpm2ps' has quite a lot of options. You can type 'xpm2ps -h' or check the GROMACS website to see how else you can render postscript images from xpm data. The version of ps.m2p that you downloaded will render the following image:
Ubiquitin secondary structure content by residue
From this plot it is easy to see the stability (or de-stability) of secondary structure elements as a function of time. For example, it appears a stretch of residues from about 55-60 mostly are in a 'turn' conformation, but sporadically adopt an 'alpha-helix' or '3-helix' conformation as well.