17 Oct 2022 | Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei
This paper explores the capabilities and limitations of large language models on challenging tasks from the BIG-Bench benchmark, focusing on a subset of 23 tasks called BIG-Bench Hard (BBH). These tasks are particularly difficult and have not been surpassed by current language models in previous evaluations. The authors apply two prompting techniques: standard "answer-only" prompting and chain-of-thought (CoT) prompting. They find that CoT prompting significantly improves performance, enabling the Codex model to surpass human-rater performance on 17 out of 23 tasks, compared to 5 tasks with answer-only prompting. The study also reveals that the effectiveness of CoT prompting depends on the model scale, with larger models showing better performance. Additionally, the authors observe that CoT prompting can unlock emergent task performance on some tasks with flat scaling curves, indicating that larger models can achieve better performance with sufficient scale. The paper discusses the limitations of CoT prompting and provides insights into the challenges and opportunities for future research in this area.This paper explores the capabilities and limitations of large language models on challenging tasks from the BIG-Bench benchmark, focusing on a subset of 23 tasks called BIG-Bench Hard (BBH). These tasks are particularly difficult and have not been surpassed by current language models in previous evaluations. The authors apply two prompting techniques: standard "answer-only" prompting and chain-of-thought (CoT) prompting. They find that CoT prompting significantly improves performance, enabling the Codex model to surpass human-rater performance on 17 out of 23 tasks, compared to 5 tasks with answer-only prompting. The study also reveals that the effectiveness of CoT prompting depends on the model scale, with larger models showing better performance. Additionally, the authors observe that CoT prompting can unlock emergent task performance on some tasks with flat scaling curves, indicating that larger models can achieve better performance with sufficient scale. The paper discusses the limitations of CoT prompting and provides insights into the challenges and opportunities for future research in this area.