Auxiliary task demands mask the capabilities of smaller language models

Auxiliary task demands mask the capabilities of smaller language models

2024 | Jennifer Hu, Michael C. Frank
The paper explores how task demands affect the performance of language models (LMs), similar to how they affect children's cognitive development. It argues that evaluations with higher task demands can mask the true abilities of smaller or less capable models. The study shows that for tasks like analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, models with fewer parameters and less training data perform worse under high-demand evaluations. This "demand gap" is more pronounced in less capable models. The research highlights that LM performance should not be interpreted as a direct measure of intelligence but rather as a reflection of their capabilities under specific evaluation conditions. The study uses various evaluation methods, including production vs. forced choice and metalinguistic judgment vs. probability measurement, to demonstrate how task demands influence model performance. Results show that as models become larger and have more training data, the demand gap decreases. The findings suggest that evaluation methods significantly impact how we interpret LM abilities, and that researchers should consider these factors when designing evaluations. The study also emphasizes the importance of valid evaluation methods in understanding LM capabilities and aligns with broader efforts in cognitive science and NLP to develop robust and fair evaluation practices.The paper explores how task demands affect the performance of language models (LMs), similar to how they affect children's cognitive development. It argues that evaluations with higher task demands can mask the true abilities of smaller or less capable models. The study shows that for tasks like analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, models with fewer parameters and less training data perform worse under high-demand evaluations. This "demand gap" is more pronounced in less capable models. The research highlights that LM performance should not be interpreted as a direct measure of intelligence but rather as a reflection of their capabilities under specific evaluation conditions. The study uses various evaluation methods, including production vs. forced choice and metalinguistic judgment vs. probability measurement, to demonstrate how task demands influence model performance. Results show that as models become larger and have more training data, the demand gap decreases. The findings suggest that evaluation methods significantly impact how we interpret LM abilities, and that researchers should consider these factors when designing evaluations. The study also emphasizes the importance of valid evaluation methods in understanding LM capabilities and aligns with broader efforts in cognitive science and NLP to develop robust and fair evaluation practices.
Reach us at info@study.space