[slides and audio] Language Model Cascades%3A Token-level uncertainty and beyond

This paper explores the use of cascading strategies in language models (LMs) to improve cost-quality trade-offs by invoking a small model for easy instances and deferring hard instances to a larger model. The authors address the challenge of designing effective deferral rules, particularly for generative LM tasks, where simple sequence-level confidence measures like Chow-Sum and Chow-Average suffer from length bias issues. They propose using token-level uncertainty, specifically quantiles over the sequence of uncertainty values, to capture finer-grained information. Experiments on various NLP benchmarks with FLAN-T5 models show that learned post-hoc deferral rules based on these quantiles outperform simple aggregation strategies. Additionally, incorporating embeddings from the smaller and larger models further enhances performance. The paper also discusses the limitations of standard deferral rules and the benefits of using intermediate embeddings from the larger model. Overall, the work provides a systematic study of deferral rules for LM cascades and demonstrates significant improvements in cost-quality trade-offs.This paper explores the use of cascading strategies in language models (LMs) to improve cost-quality trade-offs by invoking a small model for easy instances and deferring hard instances to a larger model. The authors address the challenge of designing effective deferral rules, particularly for generative LM tasks, where simple sequence-level confidence measures like Chow-Sum and Chow-Average suffer from length bias issues. They propose using token-level uncertainty, specifically quantiles over the sequence of uncertainty values, to capture finer-grained information. Experiments on various NLP benchmarks with FLAN-T5 models show that learned post-hoc deferral rules based on these quantiles outperform simple aggregation strategies. Additionally, incorporating embeddings from the smaller and larger models further enhances performance. The paper also discusses the limitations of standard deferral rules and the benefits of using intermediate embeddings from the larger model. Overall, the work provides a systematic study of deferral rules for LM cascades and demonstrates significant improvements in cost-quality trade-offs.

Language Model Cascades: Token-level uncertainty and beyond

April 17, 2024 | Neha Gupta*, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar