Horovod: fast and easy distributed deep learning in TensorFlow

Horovod: fast and easy distributed deep learning in TensorFlow

21 Feb 2018 | Alexander Sergeev, Mike Del Balso
Horovod is an open-source library that enables fast and easy distributed deep learning in TensorFlow. It addresses the challenges of scaling deep learning models by improving inter-GPU communication and reducing the need for extensive code modifications. Horovod uses ring reduction for efficient inter-GPU communication and requires only a few lines of code changes to existing TensorFlow models, making distributed training faster and easier. At Uber, the need for distributed training became apparent as models grew in size and data consumption. Traditional distributed TensorFlow approaches introduced complexity and communication overhead, limiting scalability. Horovod was developed to overcome these issues by leveraging the ring-allreduce algorithm, which allows workers to average gradients and communicate without a parameter server. This approach is more efficient and easier to implement than traditional parameter server methods. Horovod is implemented as a standalone Python package, allowing users to take advantage of ring-allreduce without needing to upgrade TensorFlow. It supports models that fit on a single server with multiple GPUs and improves the performance of models with many layers. Additionally, Horovod provides a simple API that requires minimal changes to existing TensorFlow programs, making it easy for users to distribute their training jobs. Horovod also includes features such as Tensor Fusion, which fuses small tensors before performing ring-allreduce, improving performance on models with many layers. It also includes Horovod Timeline, a profiling tool that helps users understand the states of worker nodes during training and identify potential bugs or performance issues. Horovod has been tested on various models, including Inception V3, ResNet-101, and VGG-16, and has shown significant improvements in scaling efficiency. It performs well on both TCP and RDMA-capable networks, with RDMA providing additional performance benefits for models with a high number of parameters. Horovod is available under the Apache 2.0 license and is hosted on GitHub. The authors hope that Horovod will help others adopt distributed training and better leverage their compute resources for deep learning.Horovod is an open-source library that enables fast and easy distributed deep learning in TensorFlow. It addresses the challenges of scaling deep learning models by improving inter-GPU communication and reducing the need for extensive code modifications. Horovod uses ring reduction for efficient inter-GPU communication and requires only a few lines of code changes to existing TensorFlow models, making distributed training faster and easier. At Uber, the need for distributed training became apparent as models grew in size and data consumption. Traditional distributed TensorFlow approaches introduced complexity and communication overhead, limiting scalability. Horovod was developed to overcome these issues by leveraging the ring-allreduce algorithm, which allows workers to average gradients and communicate without a parameter server. This approach is more efficient and easier to implement than traditional parameter server methods. Horovod is implemented as a standalone Python package, allowing users to take advantage of ring-allreduce without needing to upgrade TensorFlow. It supports models that fit on a single server with multiple GPUs and improves the performance of models with many layers. Additionally, Horovod provides a simple API that requires minimal changes to existing TensorFlow programs, making it easy for users to distribute their training jobs. Horovod also includes features such as Tensor Fusion, which fuses small tensors before performing ring-allreduce, improving performance on models with many layers. It also includes Horovod Timeline, a profiling tool that helps users understand the states of worker nodes during training and identify potential bugs or performance issues. Horovod has been tested on various models, including Inception V3, ResNet-101, and VGG-16, and has shown significant improvements in scaling efficiency. It performs well on both TCP and RDMA-capable networks, with RDMA providing additional performance benefits for models with a high number of parameters. Horovod is available under the Apache 2.0 license and is hosted on GitHub. The authors hope that Horovod will help others adopt distributed training and better leverage their compute resources for deep learning.
Reach us at info@study.space