Understanding DataStates-LLM%3A Lazy Asynchronous Checkpointing for Large Language Models

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models Large language models (LLMs) are increasingly adopted across various domains, requiring high-performance computing (HPC) infrastructure and massive input data. However, training LLMs is prone to failures, instability, and undesirable learning patterns, necessitating frequent checkpointing to roll back to stable states. Traditional checkpointing methods face significant I/O overheads due to the large sizes of LLMs, making them inefficient. To address this, DataStates-LLM introduces a lazy asynchronous multi-level checkpointing approach that leverages the immutability of model and optimizer state shards during training iterations, enabling background copying without interfering with the training process. This approach reduces I/O overheads by overlapping checkpointing with training, allowing for high-frequency checkpointing without significant impact on training. The system is designed to integrate with widely used LLM training runtimes like DeepSpeed and Megatron, and is implemented with high-performance considerations such as efficient data movement and serialization. Extensive experiments on large LLMs (up to 70B parameters) on modern HPC systems show that DataStates-LLM achieves up to 48× faster checkpointing and 2.2× faster end-to-end training runtime compared to state-of-the-art approaches. The system also addresses challenges related to memory and storage bottlenecks, and provides a scalable solution for asynchronous checkpointing in LLM training.DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models Large language models (LLMs) are increasingly adopted across various domains, requiring high-performance computing (HPC) infrastructure and massive input data. However, training LLMs is prone to failures, instability, and undesirable learning patterns, necessitating frequent checkpointing to roll back to stable states. Traditional checkpointing methods face significant I/O overheads due to the large sizes of LLMs, making them inefficient. To address this, DataStates-LLM introduces a lazy asynchronous multi-level checkpointing approach that leverages the immutability of model and optimizer state shards during training iterations, enabling background copying without interfering with the training process. This approach reduces I/O overheads by overlapping checkpointing with training, allowing for high-frequency checkpointing without significant impact on training. The system is designed to integrate with widely used LLM training runtimes like DeepSpeed and Megatron, and is implemented with high-performance considerations such as efficient data movement and serialization. Extensive experiments on large LLMs (up to 70B parameters) on modern HPC systems show that DataStates-LLM achieves up to 48× faster checkpointing and 2.2× faster end-to-end training runtime compared to state-of-the-art approaches. The system also addresses challenges related to memory and storage bottlenecks, and provides a scalable solution for asynchronous checkpointing in LLM training.

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

June 3–7, 2024, Pisa, Italy | Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae