2008 | Berk Hess, Carsten Kutzner, David van der Spoel, and Erik Lindahl
GROMACS 4 is a new implementation of the molecular simulation toolkit that achieves high performance on single processors and scales well on parallel machines. It includes a minimal-communication domain decomposition algorithm, full dynamic load balancing, a state-of-the-art parallel constraint solver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, a Multiple-Program, Multiple-Data approach is used, with separate node domains responsible for direct and reciprocal space interactions. These four key advances are described in the next three sections, followed by a description of other new features and a set of benchmarks to illustrate both absolute performance and relative scaling.
The domain decomposition method used in GROMACS 4 is an eighth-shell method that reduces the amount of data communicated. This method is combined with a full dynamic load-balancing algorithm with a minimum amount of communication. The constraint algorithm LINCS is used to enable holonomic constraints without iterative communication. The PME calculation is split off to dedicated PME processors. These four key advances enable extremely long simulations of large systems and provide that simulation performance on quite modest numbers of standard cluster nodes.
The domain decomposition method is illustrated in Figure 1. The basic eighth shell method was already described in 1991 by Liem et al., who implemented communication with only nearest neighbors. In GROMACS 4, this method is extended for communication with multiple cells and staggered grids for dynamic load balancing. The Shaw group has since chosen to use the midpoint method in their Desmond code since it can take advantage of hardware where each processor has two network connections that simultaneously send and receive. After quite stimulating discussions with the Shaw group, we chose not to switch to the midpoint method, primarily not only because we avoid the calculation of the midpoint, which has to be determined binary identically on multiple processors, but also because not all hardware that GROMACS will run on has two network connections. With only one network connection, a single pair of send and receive calls clearly causes less latency than two such pairs of calls.
The communication of the coordinates and charge group indices can be performed efficiently by 'pulsing' the information in one direction simultaneously for all cells one or more times. This needs to be repeated for each dimension. The number of pulses in a dimension is given by the cutoff length in that direction divided by the minimum cell size. In most cases, this will be one or two. Consider a 3D domain decomposition where we decompose in the order x, y, z; meaning that the x boundaries are aligned, the y boundaries are staggered along the x direction, and the z boundaries are staggered along the x and y directions. Each processor first sends the zone that its neighboring cell in -z needs to this cell. This process is done times. Now each processor can sendGROMACS 4 is a new implementation of the molecular simulation toolkit that achieves high performance on single processors and scales well on parallel machines. It includes a minimal-communication domain decomposition algorithm, full dynamic load balancing, a state-of-the-art parallel constraint solver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, a Multiple-Program, Multiple-Data approach is used, with separate node domains responsible for direct and reciprocal space interactions. These four key advances are described in the next three sections, followed by a description of other new features and a set of benchmarks to illustrate both absolute performance and relative scaling.
The domain decomposition method used in GROMACS 4 is an eighth-shell method that reduces the amount of data communicated. This method is combined with a full dynamic load-balancing algorithm with a minimum amount of communication. The constraint algorithm LINCS is used to enable holonomic constraints without iterative communication. The PME calculation is split off to dedicated PME processors. These four key advances enable extremely long simulations of large systems and provide that simulation performance on quite modest numbers of standard cluster nodes.
The domain decomposition method is illustrated in Figure 1. The basic eighth shell method was already described in 1991 by Liem et al., who implemented communication with only nearest neighbors. In GROMACS 4, this method is extended for communication with multiple cells and staggered grids for dynamic load balancing. The Shaw group has since chosen to use the midpoint method in their Desmond code since it can take advantage of hardware where each processor has two network connections that simultaneously send and receive. After quite stimulating discussions with the Shaw group, we chose not to switch to the midpoint method, primarily not only because we avoid the calculation of the midpoint, which has to be determined binary identically on multiple processors, but also because not all hardware that GROMACS will run on has two network connections. With only one network connection, a single pair of send and receive calls clearly causes less latency than two such pairs of calls.
The communication of the coordinates and charge group indices can be performed efficiently by 'pulsing' the information in one direction simultaneously for all cells one or more times. This needs to be repeated for each dimension. The number of pulses in a dimension is given by the cutoff length in that direction divided by the minimum cell size. In most cases, this will be one or two. Consider a 3D domain decomposition where we decompose in the order x, y, z; meaning that the x boundaries are aligned, the y boundaries are staggered along the x direction, and the z boundaries are staggered along the x and y directions. Each processor first sends the zone that its neighboring cell in -z needs to this cell. This process is done times. Now each processor can send