Asynchronous Local-SGD Training for Language Modeling

Asynchronous Local-SGD Training for Language Modeling

17 Jan 2024 | Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato
This paper presents an empirical study of asynchronous Local-SGD for training language models. The authors investigate how worker hardware heterogeneity, model size, number of workers, and optimizer choices impact learning performance. They find that naive implementations of asynchronous Local-SGD converge more slowly than its synchronous counterpart, despite more frequent parameter updates. A key challenge is the reduced effectiveness of momentum acceleration when worker gradients are stale. To address this, they propose a novel method that uses a delayed Nesterov momentum update and adjusts workers' local training steps based on their computation speed. This approach, evaluated on models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step and significantly outperforms it in terms of wall clock time. The study explores the viability of asynchronously training large language models using Local-SGD. They expand upon previous works that have attempted to alternate steps on subsets of workers or randomly drop certain subsets during synchronous Local-SGD. The main content is structured into three parts: (1) Framework, (2) Optimization Challenge, and (3) Proposed Solutions. In the Framework section, the authors describe the asynchronous Local-SGD pipeline design, including data shard sampling, learning rate scheduling, and a grace period for model synchronization. In the Optimization Challenge section, they conduct an empirical study of various optimization strategies suitable for asynchronous Local-SGD, including worker-side and server-side optimization. They uncover a key challenge in utilizing momentum effectively, noting that while adaptive momentum methods generally accelerate convergence, their efficacy in asynchronous Local-SGD is reduced when both optimizations employ momentum techniques. In the Proposed Solutions section, the authors introduce two simple and effective techniques: the Delayed Nesterov momentum update (DN) and Dynamic Local Updates (DyLU). These techniques, when combined, significantly bridge the performance gap between synchronous and asynchronous training in language modeling. The proposed method significantly surpasses DiLoCo in terms of perplexity versus wall clock time. The authors also evaluate the performance of different asynchronous Local-SGD approaches, finding that the Async. Buffer method significantly closes the gap between synchronous and asynchronous training, while introducing instability in early stages of training. They conclude that while momentum is beneficial in asynchronous Local-SGD for language modeling, its effect is more pronounced in synchronous settings. The proposed method, DN+DyLU, demonstrates consistent efficacy across various model sizes and outperforms synchronous DiLoCo in terms of wall clock time.This paper presents an empirical study of asynchronous Local-SGD for training language models. The authors investigate how worker hardware heterogeneity, model size, number of workers, and optimizer choices impact learning performance. They find that naive implementations of asynchronous Local-SGD converge more slowly than its synchronous counterpart, despite more frequent parameter updates. A key challenge is the reduced effectiveness of momentum acceleration when worker gradients are stale. To address this, they propose a novel method that uses a delayed Nesterov momentum update and adjusts workers' local training steps based on their computation speed. This approach, evaluated on models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step and significantly outperforms it in terms of wall clock time. The study explores the viability of asynchronously training large language models using Local-SGD. They expand upon previous works that have attempted to alternate steps on subsets of workers or randomly drop certain subsets during synchronous Local-SGD. The main content is structured into three parts: (1) Framework, (2) Optimization Challenge, and (3) Proposed Solutions. In the Framework section, the authors describe the asynchronous Local-SGD pipeline design, including data shard sampling, learning rate scheduling, and a grace period for model synchronization. In the Optimization Challenge section, they conduct an empirical study of various optimization strategies suitable for asynchronous Local-SGD, including worker-side and server-side optimization. They uncover a key challenge in utilizing momentum effectively, noting that while adaptive momentum methods generally accelerate convergence, their efficacy in asynchronous Local-SGD is reduced when both optimizations employ momentum techniques. In the Proposed Solutions section, the authors introduce two simple and effective techniques: the Delayed Nesterov momentum update (DN) and Dynamic Local Updates (DyLU). These techniques, when combined, significantly bridge the performance gap between synchronous and asynchronous training in language modeling. The proposed method significantly surpasses DiLoCo in terms of perplexity versus wall clock time. The authors also evaluate the performance of different asynchronous Local-SGD approaches, finding that the Async. Buffer method significantly closes the gap between synchronous and asynchronous training, while introducing instability in early stages of training. They conclude that while momentum is beneficial in asynchronous Local-SGD for language modeling, its effect is more pronounced in synchronous settings. The proposed method, DN+DyLU, demonstrates consistent efficacy across various model sizes and outperforms synchronous DiLoCo in terms of wall clock time.
Reach us at info@study.space
Understanding Asynchronous Local-SGD Training for Language Modeling