2 Apr 2024 | David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys and Adam Santoro
This paper introduces Mixture-of-Depths (MoD), a method for dynamically allocating compute in transformer-based language models. Unlike traditional transformers that uniformly allocate compute across all tokens, MoD allows the model to selectively apply compute to specific tokens in a sequence, optimizing compute usage across different layers. The method enforces a total compute budget by limiting the number of tokens that can participate in self-attention and MLP computations at each layer. Tokens are selected based on a top-k routing mechanism, which determines which tokens are processed. This approach uses a static computation graph with known tensor sizes, making it efficient and compatible with current hardware constraints. MoD enables models to use fewer FLOPs per forward pass while maintaining or improving performance. Models trained with MoD match baseline performance for equivalent FLOPs and wall-clock times, but require a fraction of the FLOPs per forward pass and can be up to 50% faster during post-training sampling. MoD also allows for a trade-off between performance and speed, with models that use fewer FLOPs per forward pass being faster to step. The method is implemented using a router that assigns a scalar weight to each token, with the top-k weights determining which tokens participate in a block's computations. MoD is shown to be effective in both training and auto-regressive sampling, with models that use fewer FLOPs per forward pass achieving comparable or better performance than isoFLOP-optimal baselines. MoD can also be integrated with Mixture-of-Experts (MoE) models to create MoDE models, which further enhance performance. The results demonstrate that MoD transformers can achieve better performance with fewer FLOPs per forward pass, making them more efficient and suitable for deployment on hardware with limited compute resources.This paper introduces Mixture-of-Depths (MoD), a method for dynamically allocating compute in transformer-based language models. Unlike traditional transformers that uniformly allocate compute across all tokens, MoD allows the model to selectively apply compute to specific tokens in a sequence, optimizing compute usage across different layers. The method enforces a total compute budget by limiting the number of tokens that can participate in self-attention and MLP computations at each layer. Tokens are selected based on a top-k routing mechanism, which determines which tokens are processed. This approach uses a static computation graph with known tensor sizes, making it efficient and compatible with current hardware constraints. MoD enables models to use fewer FLOPs per forward pass while maintaining or improving performance. Models trained with MoD match baseline performance for equivalent FLOPs and wall-clock times, but require a fraction of the FLOPs per forward pass and can be up to 50% faster during post-training sampling. MoD also allows for a trade-off between performance and speed, with models that use fewer FLOPs per forward pass being faster to step. The method is implemented using a router that assigns a scalar weight to each token, with the top-k weights determining which tokens participate in a block's computations. MoD is shown to be effective in both training and auto-regressive sampling, with models that use fewer FLOPs per forward pass achieving comparable or better performance than isoFLOP-optimal baselines. MoD can also be integrated with Mixture-of-Experts (MoE) models to create MoDE models, which further enhance performance. The results demonstrate that MoD transformers can achieve better performance with fewer FLOPs per forward pass, making them more efficient and suitable for deployment on hardware with limited compute resources.