February 6, 2024 | Jennifer Hu, Kyle Mahowald, Gary Lupyan, Anna Ivanova, Roger Levy
A recent study challenges the findings of Dentella et al. (2023), which claimed that large language models (LLMs) exhibit a "yes-response bias" and fail to distinguish grammatical from ungrammatical sentences. The researchers re-evaluate LLM performance using established methods, arguing that the data from Dentella et al. actually show that LLMs align well with human linguistic judgments. They emphasize that children can produce grammatical sentences without understanding the underlying rules, a distinction that Dentella et al. overlook. Instead of relying on models' metalinguistic skills, the study measures the probabilities models assign to strings. By constructing minimal pairs—sentences differing only in a specific linguistic feature—the researchers show that models like davinci2 and davinci3 perform at or near ceiling, except for Center Embedding, where both humans and models struggle. Minimal-pair surprisal differences correlate with human judgments, indicating that models capture human-like linguistic generalizations. The study also highlights variability in human acceptability judgments, suggesting that some sentences labeled as ungrammatical by Dentella et al. may be acceptable to certain speakers. Furthermore, when models are evaluated with the same prompt as humans, the "yes" bias disappears for all models except davinci2. GPT-3.5 Turbo and GPT-4 even outperform humans according to DGL's coding. Overall, the study concludes that LLMs demonstrate strong and human-like grammatical generalization capabilities.A recent study challenges the findings of Dentella et al. (2023), which claimed that large language models (LLMs) exhibit a "yes-response bias" and fail to distinguish grammatical from ungrammatical sentences. The researchers re-evaluate LLM performance using established methods, arguing that the data from Dentella et al. actually show that LLMs align well with human linguistic judgments. They emphasize that children can produce grammatical sentences without understanding the underlying rules, a distinction that Dentella et al. overlook. Instead of relying on models' metalinguistic skills, the study measures the probabilities models assign to strings. By constructing minimal pairs—sentences differing only in a specific linguistic feature—the researchers show that models like davinci2 and davinci3 perform at or near ceiling, except for Center Embedding, where both humans and models struggle. Minimal-pair surprisal differences correlate with human judgments, indicating that models capture human-like linguistic generalizations. The study also highlights variability in human acceptability judgments, suggesting that some sentences labeled as ungrammatical by Dentella et al. may be acceptable to certain speakers. Furthermore, when models are evaluated with the same prompt as humans, the "yes" bias disappears for all models except davinci2. GPT-3.5 Turbo and GPT-4 even outperform humans according to DGL's coding. Overall, the study concludes that LLMs demonstrate strong and human-like grammatical generalization capabilities.