Understanding Asynchronous Local-SGD Training for Language Modeling

This paper explores the use of asynchronous Local-SGD (also known as Federated Averaging) for training large language models (LLMs). Asynchronous Local-SGD allows each worker to update the global parameters as soon as it finishes its local steps, reducing communication latency. The study examines how factors such as worker hardware heterogeneity, model size, number of workers, and optimizer impact learning performance. It finds that naive implementations of asynchronous Local-SGD converge more iterations than synchronous methods despite frequent parameter updates. The key challenge identified is the stale gradients when worker gradients are outdated, leading to momentum acceleration issues. To address this, the authors propose two novel methods: Delayed Nesterov momentum update (DN) and Dynamic Local Updates (DyLU). DN uses delayed Nesterov momentum updates and adjusts workers' local training steps based on their computation speed. DyLU tailors local training steps to each worker's computational speed. These methods, evaluated on models up to 150M parameters on the C4 dataset, match the performance of synchronous Local-SGD in terms of perplexity per update step and significantly outperform it in terms of wall clock time. The study concludes that asynchronous Local-SGD, with these proposed solutions, can effectively train large language models, overcoming the straggler effect and improving efficiency.This paper explores the use of asynchronous Local-SGD (also known as Federated Averaging) for training large language models (LLMs). Asynchronous Local-SGD allows each worker to update the global parameters as soon as it finishes its local steps, reducing communication latency. The study examines how factors such as worker hardware heterogeneity, model size, number of workers, and optimizer impact learning performance. It finds that naive implementations of asynchronous Local-SGD converge more iterations than synchronous methods despite frequent parameter updates. The key challenge identified is the stale gradients when worker gradients are outdated, leading to momentum acceleration issues. To address this, the authors propose two novel methods: Delayed Nesterov momentum update (DN) and Dynamic Local Updates (DyLU). DN uses delayed Nesterov momentum updates and adjusts workers' local training steps based on their computation speed. DyLU tailors local training steps to each worker's computational speed. These methods, evaluated on models up to 150M parameters on the C4 dataset, match the performance of synchronous Local-SGD in terms of perplexity per update step and significantly outperform it in terms of wall clock time. The study concludes that asynchronous Local-SGD, with these proposed solutions, can effectively train large language models, overcoming the straggler effect and improving efficiency.

Asynchronous Local-SGD Training for Language Modeling

17 Jan 2024 | Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam and Marc'Aurelio Ranzato