March 26, 2012 | Andreas W. Götz, Mark J. Williamson, Dong Xu, Duncan Poole, Scott Le Grand, Ross C. Walker
This paper presents an implementation of generalized Born (GB) implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled NVIDIA graphics processing units (GPUs). The implementation supports three different precision models: single precision floating point arithmetic for force contributions but double precision for accumulation (SPDP), everything in single precision (SPSP), or everything in double precision (DPDP). The SPDP model is recommended as it provides results comparable to the full double precision DPDP model and the reference double precision CPU implementation, but at significantly reduced computational cost. The implementation achieves performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers.
The paper discusses the challenges of GPU programming, including vectorization, memory model, GPU to CPU and GPU to GPU communication, mathematical precision, and programming model. It also describes the AMBER implicit solvent GPU implementation, which includes features such as support for all GB models currently implemented in AMBER, analytical linearized Poisson–Boltzmann (ALPB) model, thermostats, constraints, and harmonic restraints. The implementation is designed to be transparent to the user, with performance improvements achieved through efficient use of GPU hardware.
The paper also discusses the technical details of the implementation, including the calculation of nonbonded interactions, the handling of bonded and 1–4 interactions, the use of harmonic restraints, the SHAKE algorithm, coordinate update, thermostats, and additional optimizations. The implementation is designed to be deterministic, ensuring reproducibility of results, and to be compatible with existing input files and regression tests. The paper concludes with a discussion of the parallel GPU implementation, which is currently written exclusively using MPI, and the challenges of achieving good parallel scaling on CPU clusters. The GPU implementation is fully deterministic for a given number of nodes and GPUs, and provides significant load-balancing between the SMs within each GPU rather than between GPUs.This paper presents an implementation of generalized Born (GB) implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA-enabled NVIDIA graphics processing units (GPUs). The implementation supports three different precision models: single precision floating point arithmetic for force contributions but double precision for accumulation (SPDP), everything in single precision (SPSP), or everything in double precision (DPDP). The SPDP model is recommended as it provides results comparable to the full double precision DPDP model and the reference double precision CPU implementation, but at significantly reduced computational cost. The implementation achieves performance for GB simulations on a single desktop that is on par with, and in some cases exceeds, that of traditional supercomputers.
The paper discusses the challenges of GPU programming, including vectorization, memory model, GPU to CPU and GPU to GPU communication, mathematical precision, and programming model. It also describes the AMBER implicit solvent GPU implementation, which includes features such as support for all GB models currently implemented in AMBER, analytical linearized Poisson–Boltzmann (ALPB) model, thermostats, constraints, and harmonic restraints. The implementation is designed to be transparent to the user, with performance improvements achieved through efficient use of GPU hardware.
The paper also discusses the technical details of the implementation, including the calculation of nonbonded interactions, the handling of bonded and 1–4 interactions, the use of harmonic restraints, the SHAKE algorithm, coordinate update, thermostats, and additional optimizations. The implementation is designed to be deterministic, ensuring reproducibility of results, and to be compatible with existing input files and regression tests. The paper concludes with a discussion of the parallel GPU implementation, which is currently written exclusively using MPI, and the challenges of achieving good parallel scaling on CPU clusters. The GPU implementation is fully deterministic for a given number of nodes and GPUs, and provides significant load-balancing between the SMs within each GPU rather than between GPUs.