Language models align with human judgments on key grammatical constructions

Language models align with human judgments on key grammatical constructions

February 6, 2024 | Jennifer Hu, Kyle Mahowald, Gary Lupyan, Anna Ivanova, Roger Levy
Large language models (LLMs) align with human judgments on key grammatical constructions. A study re-evaluates the performance of LLMs on grammaticality judgments, challenging previous findings that suggested LLMs exhibit a “yes-response bias” and fail to distinguish grammatical from ungrammatical sentences. The researchers used minimal-pair analysis, which measures the probability models assign to strings, to assess LLMs' linguistic knowledge. This method revealed that LLMs, particularly davinci2 and davinci3, perform at or near ceiling on most grammatical constructions except for center embedding, where humans also perform near chance. The surprisal differences between sentences and their minimal pairs predict human responses, showing that less surprising sentences are more likely to be judged as grammatical. The study also found systematic variation in human acceptability judgments, indicating that variability in acceptability is a better explanation than performance factors. Furthermore, the task differed subtly between models and humans: models were prompted for open-ended responses, while humans provided binary judgments. When evaluated using the same prompt, LLMs showed strong and human-like grammatical generalization capabilities, with GPT-3.5 Turbo and GPT-4 outperforming humans according to DGL’s normative grammaticality coding. Overall, the study concludes that LLMs demonstrate strong and human-like grammatical generalization capabilities.Large language models (LLMs) align with human judgments on key grammatical constructions. A study re-evaluates the performance of LLMs on grammaticality judgments, challenging previous findings that suggested LLMs exhibit a “yes-response bias” and fail to distinguish grammatical from ungrammatical sentences. The researchers used minimal-pair analysis, which measures the probability models assign to strings, to assess LLMs' linguistic knowledge. This method revealed that LLMs, particularly davinci2 and davinci3, perform at or near ceiling on most grammatical constructions except for center embedding, where humans also perform near chance. The surprisal differences between sentences and their minimal pairs predict human responses, showing that less surprising sentences are more likely to be judged as grammatical. The study also found systematic variation in human acceptability judgments, indicating that variability in acceptability is a better explanation than performance factors. Furthermore, the task differed subtly between models and humans: models were prompted for open-ended responses, while humans provided binary judgments. When evaluated using the same prompt, LLMs showed strong and human-like grammatical generalization capabilities, with GPT-3.5 Turbo and GPT-4 outperforming humans according to DGL’s normative grammaticality coding. Overall, the study concludes that LLMs demonstrate strong and human-like grammatical generalization capabilities.
Reach us at info@study.space
Understanding Language models align with human judgments on key grammatical constructions