Difference between revisions of "MD Simulation: Protein in Water (Pt. 2)"

From Rizzo_Lab
Jump to: navigation, search
Line 1: Line 1:
==[[MD Simulation: Protein in Water]] (Pt. 1)==
==[[MD Simulation: Protein in Water]] (Pt. 1)==
==Equilibration MD==
==Equilibration MD==

Revision as of 14:53, 6 December 2011

MD Simulation: Protein in Water (Pt. 1)

Equilibration MD

Equilibration will be carried out in two steps. First, an NVT (constant Number of atoms, Volume, and Temperature) simulation will be performed in order to bring the system to the target temperature. Second, an NPT (constant Number of atoms, Pressure, and Temperature) simulation will be performed to allow the system to find the correct density.

NVT Simulation

A parameter file for an NVT simulation can be found here: ubq_nvt.mdp. This file calls for an MD run of 25,000 steps with a 2 fs timestep (a total of 50 ps). The 'define = -DPOSRES' line initiates the backbone restraints on the protein, which should remain on until production MD. Temperature coupling is performed using the Berendsen method. In order to be sure velocities (and therefore temperature) are evenly distributed across all of the molecule types, it is a good idea to couple protein atoms and non-protein atoms to separate baths. Without separate coupling, you run the risk weird phenomena such as a system where the water molecules are at 350 K and the protein is at 200 K. Finally, all bonds are constrained using the linear constraint solver (LINCS) algorithm. Review the GROMACS manual if you are unfamiliar with the implementation or purpose of any of these parameters.

Execute the command:

grompp -f ubq_nvt.mdp -c ubiquitin_min.gro -p topol.top -o input_nvt.tpr

Read the output carefully. If it is okay to proceed, then:

mdrun -s input_nvt.tpr -deffnm ubiquitin_nvt -v

This step will take longer than the energy minimization. However, GROMACS automatically threads an 'mdrun' job across all of the processors available on the host machine. If you executed this command on George, it will try to use all 8 processors and the job should only take about 5-10 minutes. If you would like to run 'mdrun' in parallel on Cluster, the command line should look something like:

mpirun -n 16 mdrun -s input_nvt.tpr -deffnm ubiquitin_nvt -np 16

A good rule of thumb is that for a given system you will achieve the most efficient scaling if you use N processors, where N = <# of atoms> / 1000. At the completion of the job, you will find the 4 types of output files that you saw following energy minimization, in addition to 2 new output files:

1.) ubiquitin_nvt.gro    # system coordinates at the final step
2.) ubiquitin_nvt.trr    # full-precision trajectory file
3.) ubiquitin_nvt.log    # log file
4.) ubiquitin_nvt.edr    # energy file
5.) ubiquitin_nvt.xtc    # compressed trajectory file
6.) ubiquitin_nvt.cpt    # checkpoint file

The .xtc file is another trajectory file. For the same number of frames, the .xtc file is much smaller than the .trr file because it contains only atomic coordinates, whereas the .trr file also contains full-precision velocities and forces, which are only useful for continuing runs. In practice, you may find it most convenient to only write to the .trr file once at the end of a simulation, and to write to the .xtc file much more frequently (e.g., every 1 ps) for analysis. To view the trajectory, load the original coordinate file (ubiquitin_min.gro) into VMD, followed by the .xtc file. The final file, 'ubiquitin_nvt.cpt', is a new-ish implementation in GROMACS that renders the .trr file obsolete. The checkpoint file is by default written to disc every 15 minutes (and always at the very end of a simulation), and it contains the latest full-precision velocities, forces, and energies that are required to continue a simulation, or restart a crashed-simulation. This will be demonstrated in the next section, but first, it is important to verify that the ubiquitin system equilibrated around the target temperature. Use 'g_energy' again, and this time choose '14' for Temperature.

g_energy -f ubiquitin_nvt.edr -o energy.xvg

Visualize the resulting plot in Grace:

Ubq nvt energy.jpg System temperature as a function of time during NVT equilibration.

From this plot it is apparent that the system is well equilibrated around the target temperature (310 K), and it is okay to proceed to the NPT equilibration step.

NPT Simulation

Following equilibration to a target temperature, you must next relax your system into a constant pressure ensemble. A parameter file for NPT simulation can be found here: ubq_npt.mdp. As you can see, this parameter file calls for a run of 250,000 steps with a 2 fs timestep (a total of 500 ps). Position restraints are still being applied to the backbone. The Berendsen thermostat has been replaced with a Nose-Hoover thermostat, which will produce a more correct ensemble of kinetic energies. The Nose-Hoover thermostat, however, must be implemented only after the system is already near its target temperature (which was achieved by the Berendsen thermostat), or the kinetic energy of the system will fluctuate wildly. Because this run will be a continuation of the previous run, 'gen_vel' has been set to 'no'. Finally, the Parrinello-Rahman method is used to couple pressure isotropically (the same in all directions) to a value of 1.0 bar. As soon as you have a thorough understanding of the logic behind each setting in the parameter file, pre-process your run input file with grompp:

grompp -f ubq_npt.mdp -c ubiquitin_nvt.gro -t ubiquitin_nvt.cpt -p topol.top -o input_npt.tpr

The important difference with this invocation of 'grompp' is that you specified '-t ubiquitin_nvt.cpt'. This flag indicates that the forthcoming run will be a continuation of the previous run, taking into account the full-precision velocities and forces on all atoms from the checkpoint file. If there are no Errors, then:

mdrun -s input_npt.tpr -deffnm ubiquitin_npt -v

This run requires many more steps than the NVT equilibration did, and it may take an hour or so if you run it on George. When complete. you will have the same 6 types of output files as those that resulted from the NVT simulation. Use g_energy again to tell whether the goal of the equilibration was acheived:

g_energy -f ubiquitin_npt.edr -o energy.xvg

Choose '22' for density and visualize the resulting plot in Grace:

Ubq npt energy.jpg System density as a function of time during NPT equilibration.

The system appears to have quickly equilibrated very near to a physiological value. By adding the following flags to g_energy:

g_energy -f ubiquitin_npt.edr -b 250

and again choosing '22', the program reports a density of 1002.1 kg/m^3 as averaged over the last 250 ps of simulation. At this point, it is okay to move on to production MD.

Production MD

The parameter file for production MD looks almost exactly like that of the NPT equilibration step with one important exception: position restraints on the backbone have been lifted. The parameter file can be found here: ubq_md.mdp. This file calls for 2,500,000 MD steps with a 2 fs timestep, totaling 5,000 ps (or 5 ns) of sampling. Other than the longer sampling time and the release of restraints, there are no other differences between this simulation and the NPT equilibration. To pre-process the run input file, execute:

grompp -f ubq_md.mdp -c ubiquitin_npt.gro -t ubiquitin_npt.cpt -p topol.top -o input_md.tpr

Remember, it is important to provide the checkpoint file on the command line as this is a continuation of the previous run. If there are no Errors, then begin simulation with:

mdrun -s input_md.tpr -deffnm ubiquitin_md

Running this simulation on George may take 10-12 hours. It will take significantly less time if you run it in parallel on cluster. Examine the last 200 or so lines of the .log file to find some statistics on the speed of the run, as well as a very detailed decomposition of where all of the computation time was spent:

     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes     Number     G-Cycles    Seconds     %
 Domain decomp.         8    2500000    91520.964    32768.2    11.3
 DD comm. load          8    2499011      699.920      250.6     0.1
 DD comm. bounds        8    2499001     1137.751      407.4     0.1
 Comm. coord.           8    2500001     3508.266     1256.1     0.4
 Neighbor search        8    2500001   232396.884    83207.4    28.8
 Force                  8    2500001   204567.718    73243.5    25.3
 Wait + Comm. F         8    2500001     4545.445     1627.5     0.6
 PME mesh               8    2500001   213796.271    76547.7    26.5
 Write traj.            8       5041       74.665       26.7     0.0
 Update                 8    2500001    14321.922     5127.8     1.8
 Constraints            8    2500001    27057.311     9687.6     3.3
 Comm. energies         8    2500002     3582.277     1282.6     0.4
 Rest                   8               11075.937     3965.6     1.4
 Total                  8              808285.331   289398.7   100.0
 PME redist. X/F        8    5000002    45742.470    16377.6     5.7
 PME spread/gather      8    5000002    94963.334    34000.7    11.7
 PME 3D-FFT             8    5000002    52335.780    18738.3     6.5
 PME solve              8    2500001    20644.201     7391.5     2.6

        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:  36174.836  36174.836    100.0
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    133.592     11.329     11.942      2.010
Finished mdrun on node 0 Fri Jul 22 11:24:04 2011

Extending a run requires two simple steps. First, the run input file is modified to specify a longer simulation time. Second, 'mdrun' is called again with some additional flags. For example, if you would like to extend this run by 5,000 ps, you would execute the following:

tpbconv -s input_md.tpr -o input_md_extended.tpr -extend 5000
mdrun -s input_md_extended.tpr -deffnm ubiquitin_md -cpi ubiquitin_md.cpt -append

The first command uses the tool 'tpbconv' to add 5,000 ps of simulation time to the .tpr file. Of course the output file name (specified with the '-o' flag) is arbitrary. In fact, the output file name can be the same as the input file name. GROMACS would automatically backup the original file and you would not need to worry about keeping track of multiple .tpr files. The '-append' flag on the second line tells 'mdrun' to append all new calculations to the end of the original output files from the first 5 ns simulation. In fact, if your run were to die because the server crashes, for example, you would just need to execute the 'mdrun' command above with your original .tpr file in order to restart it.


GROMACS comes equipped with many analysis tools, a complete list of which can be found in the manual. Here you will be exposed to a few useful analysis tools: 'g_rms', 'g_rmsf', and 'do_dssp'. But first, it is useful to learn how to process your trajectory file to only keep the components you are interested in. For example, start with the two files:

1.) input_md.tpr        # run input file for production MD
2.) ubiquitin_md.xtc    # full system 5 ns trajectory

The next three analyses that you are going to perform only require protein coordinates, not water / ion coordinates. Use 'trjconv' to remove water molecules and ions from the .xtc file, then use 'tpbconv' to remove water / ion information from the .tpr file:

trjconv -s input_md.tpr -f ubiquitin_md.xtc -o protein.xtc
tpbconv -s input_md.tpr -o protein.tpr

In each case, you will be prompted to choose a group. Choose group '1 (Protein)' both times. Now, use 'trjconv' again to dump out the first frame of the trajectory:

trjconv -s protein.tpr -f protein.xtc -o protein.gro -dump 0

Choose either '0 (System)' or '1 (Protein)' to dump all protein coordinates (with these modified input files, the protein is the whole system). The '-dump 0' command will write out the coordinates closest to 0 ps. At this point, it is a good idea to open protein.gro in VMD, followed by protein.xtc in order to watch the trajectory and make sure nothing funny happened during the simulation. In the production MD parameter file, you turned on center-of-mass motion removal. That is why the protein tends to stay in the center of the box during simulation. However, you will notice that it still rotates around all three axes freely. To remove that, you simply need to perform a best-fit translation and rotation calculation on each frame of the trajectory using the starting coordinates as the reference structure. To do so, execute:

trjconv -s protein.tpr -f protein.xtc -o protein_fit.xtc -fit rot+trans

It will prompt you for a group to use when calculating the least-squares fit. Traditionally, the backbone is used for such a calculation, so choose '4 (Backbone)'. Then it will ask which group you would like to write to disk. Choose '0 (System)' or '1 (Protein)'. Load this modified trajectory into VMD to see the effect.


Root mean square deviation (RMSD) is a measure of how much the protein structure changes over the course of the simulation. Changes on the order of 1-3 Angstroms are perfectly acceptable (and expected) for small, globular proteins. Changes much larger than that, however, indicate that your protein may be misbehaving during the simulation. The other thing to look for is convergence. If the RMSD of your protein is still changing at the end of your simulation (trending upwards, usually), then your simulation is probably not long enough. To measure RMSD, use the GROMACS tool 'g_rms':

g_rms -s protein.tpr -f protein_fit.xtc -o rmsd.xvg

Choose '4 (Backbone)' for the least squares fit, and '4 (Backbone)' again for the RMSD measurement. The reference structure is taken from 'protein.tpr'. Each frame in protein_fit.xtc is iteratively fit to the reference structure (because you already did this in the previous step, you can turn the fitting off by specifying '-fit none' on the command line). The output file will be called 'rmsd.xvg', which can be visualized in Grace.

Ubq rmsd.jpg Ubiquitin backbone RMSD during a 5 ns NPT production simulation.

The black line represent the RMSD of the protein backbone during the simulation, which remains within an acceptable range. A look at the running average of the RMSD (red line) indicates that, despite this being only a 5 ns simulation, the RMSD is reasonably stable around 0.11 nm.


RMSD is a measure of global backbone deviation. If you are interested in more local changes, you can measure the root mean square fluctuation (RMSF). The GROMACS tool 'g_rmsf' was written for that purpose. Execute the following command:

g_rmsf -s protein.tpr -f protein_fit.xtc -o rmsf.xvg -oq rmsf.pdb -res

Now you must make a decision. Are you more interested in backbone fluctuation, sidechain fluctuation, entire residue fluctuation, or something completely different? What you pick here will probably change from project to project depending on the scientific question you are trying to ask. In the case of ubiquitin, you may ask yourself, why were the occupancy values of the last four residues less than 1? (If you remember back to the crystal structure and 'pdb2gmx', you were presented with a Warning about this exactly). The occupancy values in the crystal structure applied to entire residues - not just the backbone or just the sidechain. Therefore, choose '1 (Protein)', in order to measure fluctuations of whole residues (the '-res' flag will group the output by whole residues rather than by individual atoms). There are two important output files:

1.) rmsf.xvg    # a data file of RMSF values plotted by residue number that can be viewed in Grace
2.) rmsf.pdb    # a structure file that contains beta-factors equal to RMSF values

The first file should be viewed in Grace:

Ubq rmsf plot.jpg Ubiquitin backbone RMSF averaged over 5 ns simulation.

On this plot, peaks indicate areas of the protein that fluctuated the most during the simulation. The C-terminal tail fluctuated much more than any other part of the protein, helping to explain why the crystallographers assigned such a low resolution to that area.

Open the second file, 'rmsf.pdb', in VMD and color residues by Beta. The result should look like the following:

Ubq rmsf structure.jpg Ubiquitin structure colored by RMSF. Red values indicate areas of more fluctuation, blue values indicate areas of less fluctuation.

Now it is easy to tell that the C-terminus of the protein fluctuates the most, and it appears that the alpha-helix is the most stable part of the protein.

As a final note on RMSF analysis, you must be aware that the RMSF plot will change depending on what part of the simulation you choose to analyze. For example, the plot seen above is the average fluctuation of each residue over the entire simulation. In practice, that is not the best way to measure RMSF. At the beginning of the simulation, certain residues may be fluctuating more as they transition from the crystal form of the protein to a fully solvated and dynamic form. It is better to simulate for several nanoseconds, and when the RMSD converges, use that as a starting point for RMSF measurements. For example if you want to measure RMSF beginning from 4 ns, the command would be:

g_rmsf -s protein.tpr -f protein_fit.xtc -o rmsf.xvg -oq rmsf.pdb -res -b 4000


DSSP stands for Dictionary of Secondary Structure of Proteins (or something similar). The purpose of this analysis is to measure the secondary structure content of a protein as a function of time. The analysis breaks down secondary structure assignments (i.e. helix, sheet, turn, etc.) by individual residues for each time step, allowing you to visualize or quantify the data in meaningful ways. The GROMACS tool 'do_dssp' uses the DSSP library and executable from this website to measure secondary structure in your trajectory file. George already has the appropriate DSSP library and executable installed. All you have to do to use dssp is issue 'do_dssp', and GROMACS will automatically find the library and executable. Type the following:

do_dssp -f protein.xtc -s protein.tpr -sc scount.xvg -o ss.xpm -dt 10

When prompted, you must choose '5 (Mainchain)'. The trajectory saved snapshots at every 1 ps, which is a little too frequent for this type of analysis, so use the option '-dt 10' to only look at every 10th frame. The first output file, scount.xvg, can be visualized in Grace by typing:

xmgrace -nxy scount.xvg

Ubq scount.jpg Ubiquitin secondary structure content

As an added bonus, the end of the file 'scount.xvg' contains the following two lines:

# Totals 26169  9107 11705  1009  1506  7064  6391  1294
# SS %    0.69  0.24  0.31  0.03  0.04  0.19  0.17  0.03


@ s0 legend "Structure"
@ s1 legend "Coil"
@ s2 legend "B-Sheet"
@ s3 legend "B-Bridge"
@ s4 legend "Bend"
@ s5 legend "Turn"
@ s6 legend "A-Helix"
@ s7 legend "3-Helix"

This is a very convenient way to analyze average secondary structure content of a protein. This could be especially useful when trying to predict the secondary structure of unstructured proteins as well.

The second file, ss.xpm, contains a breakdown of secondary structure for each residue as a function of time step. Although some programs can open this type of file, it is generally easier to process it with the GROMACS tool 'xpm2ps' first. Make a copy of this file: ps.m2p, then execute the following:

xpm2ps -f ss.xpm -di ps.m2p -o ubq_ss.eps

The tool 'xpm2ps' has quite a lot of options. You can type 'xpm2ps -h' or check the GROMACS website to see how else you can render postscript images from xpm data. The version of ps.m2p that you downloaded will render the following image:

Ubq ss.jpg Ubiquitin secondary structure content by residue

From this plot it is easy to see the stability (or de-stability) of secondary structure elements as a function of time. For example, it appears a stretch of residues from about 55-60 mostly are in a 'turn' conformation, but sporadically adopt an 'alpha-helix' or '3-helix' conformation as well.