[slides] Auxiliary task demands mask the capabilities of smaller language models

The paper explores the impact of task demands on the evaluation of language models (LMs), particularly in comparing models of different capabilities. Task demands are auxiliary challenges that can mask a child's underlying cognitive abilities and are similarly applicable to LM evaluations. The authors argue that higher task demands can lead to lower performance, especially for less capable models, a phenomenon they term the "demand gap." They investigate this gap across various tasks and open-source LMs, including analogical reasoning, reflective reasoning, word prediction, and grammatical judgments. The results show that the demand gap is more pronounced for models with fewer parameters and less training data. The study also examines the relationship between the demand gap and model size, finding that it decreases as the number of parameters increases. Additionally, the authors analyze the interaction between training time and task demands, showing that lower-demand methods can reveal abilities earlier in training. The findings suggest that LM performance should be interpreted as a reflection of capacities influenced by researchers' design choices rather than a direct indication of intelligence. The paper concludes by emphasizing the importance of considering task demands in LM evaluations to ensure valid inferences about cognitive capacities.The paper explores the impact of task demands on the evaluation of language models (LMs), particularly in comparing models of different capabilities. Task demands are auxiliary challenges that can mask a child's underlying cognitive abilities and are similarly applicable to LM evaluations. The authors argue that higher task demands can lead to lower performance, especially for less capable models, a phenomenon they term the "demand gap." They investigate this gap across various tasks and open-source LMs, including analogical reasoning, reflective reasoning, word prediction, and grammatical judgments. The results show that the demand gap is more pronounced for models with fewer parameters and less training data. The study also examines the relationship between the demand gap and model size, finding that it decreases as the number of parameters increases. Additionally, the authors analyze the interaction between training time and task demands, showing that lower-demand methods can reveal abilities earlier in training. The findings suggest that LM performance should be interpreted as a reflection of capacities influenced by researchers' design choices rather than a direct indication of intelligence. The paper concludes by emphasizing the importance of considering task demands in LM evaluations to ensure valid inferences about cognitive capacities.

Auxiliary task demands mask the capabilities of smaller language models

29 Jul 2024 | Jennifer Hu, Michael C. Frank