Challenging BIG-Bench tasks and whether chain-of-thought can solve them

Challenging BIG-Bench tasks and whether chain-of-thought can solve them

17 Oct 2022 | Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei
This paper investigates the performance of language models on challenging tasks from the BIG-Bench benchmark, focusing on whether chain-of-thought (CoT) prompting can improve performance on tasks that previously outperformed human raters. The authors identify 23 particularly difficult tasks from BIG-Bench, called BIG-Bench Hard (BBH), which were not solved by prior language models. They find that applying CoT prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. The study shows that CoT prompting significantly improves performance on tasks requiring multi-step reasoning, and that performance gains from CoT prompting only emerge with sufficiently large models. The results indicate that CoT prompting enables emergent task performance on several BBH tasks that exhibit flat scaling curves. The authors also find that CoT prompting does not unlock emergent task performance in all BBH tasks with flat scaling curves, and that some tasks still require more powerful models or prompting techniques. The study highlights the importance of considering model scale when evaluating language models and suggests that CoT prompting is a key technique for unlocking emergent capabilities in large language models. The authors release the data and prompts used in the work, as well as the outputs from the Codex models, to facilitate further research.This paper investigates the performance of language models on challenging tasks from the BIG-Bench benchmark, focusing on whether chain-of-thought (CoT) prompting can improve performance on tasks that previously outperformed human raters. The authors identify 23 particularly difficult tasks from BIG-Bench, called BIG-Bench Hard (BBH), which were not solved by prior language models. They find that applying CoT prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. The study shows that CoT prompting significantly improves performance on tasks requiring multi-step reasoning, and that performance gains from CoT prompting only emerge with sufficiently large models. The results indicate that CoT prompting enables emergent task performance on several BBH tasks that exhibit flat scaling curves. The authors also find that CoT prompting does not unlock emergent task performance in all BBH tasks with flat scaling curves, and that some tasks still require more powerful models or prompting techniques. The study highlights the importance of considering model scale when evaluating language models and suggests that CoT prompting is a key technique for unlocking emergent capabilities in large language models. The authors release the data and prompts used in the work, as well as the outputs from the Codex models, to facilitate further research.
Reach us at info@study.space