This paper investigates the problem of deferral rules in language model (LM) cascades, focusing on how to effectively measure uncertainty at the token level to improve the cost-quality tradeoff. While deferral rules for classification tasks are well understood, similar principles are less clear for generative LM tasks. The authors propose a new approach that leverages token-level uncertainty, which is more nuanced than the commonly used sequence-level uncertainty measures like Chow-Sum and Chow-Average. They show that these measures suffer from a length bias problem, where longer sequences are more likely to be deferred, regardless of their quality. To address this, they introduce a method based on quantiles of the token-level uncertainty distribution, which captures more detailed information about the uncertainty of individual tokens. This approach outperforms simple aggregation methods and is shown to be more effective in various NLP benchmarks. Additionally, the authors propose a post-hoc deferral rule trained on quantile features, which further improves the cost-quality tradeoff. They also explore the use of embeddings from the smaller and larger models to enhance performance. The results demonstrate that incorporating token-level uncertainty and learned deferral rules significantly improves the effectiveness of LM cascades, particularly in scenarios where the quality of the output is uncertain. The study highlights the importance of considering finer-grained uncertainty measures in generative LM tasks and provides a systematic approach to designing deferral rules that balance cost and quality.This paper investigates the problem of deferral rules in language model (LM) cascades, focusing on how to effectively measure uncertainty at the token level to improve the cost-quality tradeoff. While deferral rules for classification tasks are well understood, similar principles are less clear for generative LM tasks. The authors propose a new approach that leverages token-level uncertainty, which is more nuanced than the commonly used sequence-level uncertainty measures like Chow-Sum and Chow-Average. They show that these measures suffer from a length bias problem, where longer sequences are more likely to be deferred, regardless of their quality. To address this, they introduce a method based on quantiles of the token-level uncertainty distribution, which captures more detailed information about the uncertainty of individual tokens. This approach outperforms simple aggregation methods and is shown to be more effective in various NLP benchmarks. Additionally, the authors propose a post-hoc deferral rule trained on quantile features, which further improves the cost-quality tradeoff. They also explore the use of embeddings from the smaller and larger models to enhance performance. The results demonstrate that incorporating token-level uncertainty and learned deferral rules significantly improves the effectiveness of LM cascades, particularly in scenarios where the quality of the output is uncertain. The study highlights the importance of considering finer-grained uncertainty measures in generative LM tasks and provides a systematic approach to designing deferral rules that balance cost and quality.