June 3–7, 2024 | Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2024. DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models. In The 33rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC'24), June 3–7, 2024, Pisa, Italy. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3625549.3658685
Large language models (LLMs) are increasingly adopted across various domains. They require training on high-performance computing (HPC) infrastructures and processing massive input data. However, due to the scale of LLMs, unexpected events such as hardware failures, software bugs, or communication issues are frequent and can negatively impact training. Therefore, frequent checkpointing is necessary to roll back to a stable state and fine-tune the model. However, traditional checkpointing methods that directly write model parameters and optimizer states to persistent storage incur significant I/O overheads. To address this, DataStates-LLM introduces a lazy asynchronous multi-level approach that leverages the immutability of model and optimizer state shards during training iterations. This allows background copying of these shards without interfering with the training process. The approach is evaluated at scales up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. Results show up to 48× faster checkpointing and 2.2× faster end-to-end training runtime compared to state-of-the-art methods.
LLMs are trained using data, pipeline, and tensor parallelism, which require sharding of model parameters and optimizer states across multiple GPUs. Synchronous checkpointing methods block training until the model state is saved to stable storage, leading to high runtime overheads. Asynchronous checkpointing methods, such as those used in DeepSpeed, can reduce these overheads by copying model states to fast memory tiers and flushing them to slower tiers in the background. However, these methods face challenges due to limited GPU memory and competition for PCIe links between GPUs and host memory. DataStates-LLM addresses these challenges by leveraging the immutability of model and optimizer state shards during training iterations, allowing asynchronous copying of these shards without blocking the training process. The approach is implemented as a modular extension to the DeepSpeed runtime, enabling efficient checkpointing with minimal overhead. The results show that DataStates-LLM achieves significantly higher checkpointing throughput and faster end-to-end training runtime compared to other state-of-the-art methods.DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2024. DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models. In The 33rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC'24), June 3–7, 2024, Pisa, Italy. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3625549.3658685
Large language models (LLMs) are increasingly adopted across various domains. They require training on high-performance computing (HPC) infrastructures and processing massive input data. However, due to the scale of LLMs, unexpected events such as hardware failures, software bugs, or communication issues are frequent and can negatively impact training. Therefore, frequent checkpointing is necessary to roll back to a stable state and fine-tune the model. However, traditional checkpointing methods that directly write model parameters and optimizer states to persistent storage incur significant I/O overheads. To address this, DataStates-LLM introduces a lazy asynchronous multi-level approach that leverages the immutability of model and optimizer state shards during training iterations. This allows background copying of these shards without interfering with the training process. The approach is evaluated at scales up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. Results show up to 48× faster checkpointing and 2.2× faster end-to-end training runtime compared to state-of-the-art methods.
LLMs are trained using data, pipeline, and tensor parallelism, which require sharding of model parameters and optimizer states across multiple GPUs. Synchronous checkpointing methods block training until the model state is saved to stable storage, leading to high runtime overheads. Asynchronous checkpointing methods, such as those used in DeepSpeed, can reduce these overheads by copying model states to fast memory tiers and flushing them to slower tiers in the background. However, these methods face challenges due to limited GPU memory and competition for PCIe links between GPUs and host memory. DataStates-LLM addresses these challenges by leveraging the immutability of model and optimizer state shards during training iterations, allowing asynchronous copying of these shards without blocking the training process. The approach is implemented as a modular extension to the DeepSpeed runtime, enabling efficient checkpointing with minimal overhead. The results show that DataStates-LLM achieves significantly higher checkpointing throughput and faster end-to-end training runtime compared to other state-of-the-art methods.