2 Apr 2024 | David Raposo1*, Sam Ritter1, Blake Richards1,2, Timothy Lillicrap1, Peter Conway Humphreys1 and Adam Santoro1*
The paper introduces Mixture-of-Depths (MoD), a method for dynamically allocating computational resources in transformer-based language models. Unlike traditional transformers that uniformly distribute FLOPs across input sequences, MoD allows the model to optimize the allocation of FLOPs to specific positions in the sequence, improving efficiency. The method enforces a total compute budget by limiting the number of tokens participating in self-attention and MLP computations at each layer, determined by a top-$k$ routing mechanism. This approach maintains a static computation graph with known tensor sizes while allowing dynamic and context-sensitive allocation of FLOPs at the token level. Models trained with MoD achieve performance comparable to baseline models while requiring fewer FLOPs per forward pass, leading to faster inference speeds. The paper also discusses the implementation details, including the use of learned routing mechanisms and the integration of MoD with Mixture-of-Experts (MoE) models. The results show that MoD can improve performance and speed without sacrificing overall model performance, making it a valuable tool for optimizing transformer-based language models.The paper introduces Mixture-of-Depths (MoD), a method for dynamically allocating computational resources in transformer-based language models. Unlike traditional transformers that uniformly distribute FLOPs across input sequences, MoD allows the model to optimize the allocation of FLOPs to specific positions in the sequence, improving efficiency. The method enforces a total compute budget by limiting the number of tokens participating in self-attention and MLP computations at each layer, determined by a top-$k$ routing mechanism. This approach maintains a static computation graph with known tensor sizes while allowing dynamic and context-sensitive allocation of FLOPs at the token level. Models trained with MoD achieve performance comparable to baseline models while requiring fewer FLOPs per forward pass, leading to faster inference speeds. The paper also discusses the implementation details, including the use of learned routing mechanisms and the integration of MoD with Mixture-of-Experts (MoE) models. The results show that MoD can improve performance and speed without sacrificing overall model performance, making it a valuable tool for optimizing transformer-based language models.