[slides and audio] Horovod%3A fast and easy distributed deep learning in TensorFlow

Horovod is an open-source library designed to simplify and enhance distributed deep learning in TensorFlow. It addresses two main challenges: inter-GPU communication overhead and the complexity of modifying training code for multi-GPU training. Horovod employs efficient ring reduction for inter-GPU communication, requiring minimal code changes from users. This library is available under the Apache 2.0 license and is part of Uber's Michelangelo platform, which aims to democratize machine learning. The paper highlights the limitations of standard TensorFlow's distributed training approach, which often introduces significant communication overhead and requires substantial code modifications. Horovod overcomes these issues by using a ring-allreduce algorithm, which optimizes network usage and simplifies the user experience. The library supports both TensorFlow and Keras models and can be easily integrated into existing workflows with just a few lines of code. Key features of Horovod include: - **Efficient Communication**: Utilizes ring reduction to minimize communication overhead. - **Simplified Code**: Requires minimal changes to user code. - **Performance Improvements**: Demonstrates significant speedups in distributed training, especially with large models like ResNet-101 and VGG-16. - **Debugging Tools**: Offers Horovod Timeline for profiling and debugging distributed training jobs. - **Tensor Fusion**: Enhances performance by fusing small tensors before applying ring reduction. Benchmarks show that Horovod can achieve up to 88% efficiency in scaling across multiple GPUs, significantly outperforming standard TensorFlow. The library also supports RDMA networking, further improving performance for models with a high number of parameters. Future developments include making MPI installation easier, sharing learnings on adjusting model parameters for distributed training, and adding support for very large models. Horovod aims to make distributed deep learning more accessible and efficient for researchers and practitioners.Horovod is an open-source library designed to simplify and enhance distributed deep learning in TensorFlow. It addresses two main challenges: inter-GPU communication overhead and the complexity of modifying training code for multi-GPU training. Horovod employs efficient ring reduction for inter-GPU communication, requiring minimal code changes from users. This library is available under the Apache 2.0 license and is part of Uber's Michelangelo platform, which aims to democratize machine learning. The paper highlights the limitations of standard TensorFlow's distributed training approach, which often introduces significant communication overhead and requires substantial code modifications. Horovod overcomes these issues by using a ring-allreduce algorithm, which optimizes network usage and simplifies the user experience. The library supports both TensorFlow and Keras models and can be easily integrated into existing workflows with just a few lines of code. Key features of Horovod include: - **Efficient Communication**: Utilizes ring reduction to minimize communication overhead. - **Simplified Code**: Requires minimal changes to user code. - **Performance Improvements**: Demonstrates significant speedups in distributed training, especially with large models like ResNet-101 and VGG-16. - **Debugging Tools**: Offers Horovod Timeline for profiling and debugging distributed training jobs. - **Tensor Fusion**: Enhances performance by fusing small tensors before applying ring reduction. Benchmarks show that Horovod can achieve up to 88% efficiency in scaling across multiple GPUs, significantly outperforming standard TensorFlow. The library also supports RDMA networking, further improving performance for models with a high number of parameters. Future developments include making MPI installation easier, sharing learnings on adjusting model parameters for distributed training, and adding support for very large models. Horovod aims to make distributed deep learning more accessible and efficient for researchers and practitioners.

Horovod: fast and easy distributed deep learning in TensorFlow

21 Feb 2018 | Alexander Sergeev, Mike Del Balso