[slides] Can large language models help augment English psycholinguistic datasets%3F

Can large language models help augment English psycholinguistic datasets? Sean Trott Abstract: Research on language and cognition relies extensively on psycholinguistic datasets or "norms". These datasets contain judgments of lexical properties like concreteness and age of acquisition, and can be used to norm experimental stimuli, discover empirical relationships in the lexicon, and stress-test computational models. However, collecting human judgments at scale is both time-consuming and expensive. This issue of scale is compounded for multi-dimensional norms and those incorporating context. The current work asks whether large language models (LLMs) can be leveraged to augment the creation of large, psycholinguistic datasets in English. I use GPT-4 to collect multiple kinds of semantic judgments (e.g., word similarity, contextualized sensorimotor associations, iconicity) for English words and compare these judgments against the human "gold standard". For each dataset, I find that GPT-4’s judgments are positively correlated with human judgments, in some cases rivaling or even exceeding the average inter-annotator agreement displayed by humans. I then identify several ways in which LLM-generated norms differ from human-generated norms systematically. I also perform several "substitution analyses", which demonstrate that replacing human-generated norms with LLM-generated norms in a statistical model does not change the sign of parameter estimates (though in select cases, there are significant changes to their magnitude). I conclude by discussing the considerations and limitations associated with LLM-generated norms in general, including concerns of data contamination, the choice of LLM, external validity, construct validity, and data quality. Additionally, all of GPT-4’s judgments (over 30,000 in total) are made available online for further analysis. Keywords: Dataset · Psycholinguistic resource · Large language models · ChatGPT Introduction: Research on language and cognition relies extensively on large, psycholinguistic datasets—sometimes called "norms". These datasets contain information about various properties of words and sentences, including concreteness, sensorimotor associations, affect, semantic similarity, iconicity, and more. Building these datasets is often time-consuming and expensive. One possible solution is to augment the construction of psycholinguistic datasets using computational tools, such as Large Language Models (LLMs), to reduce this difficulty; this approach is seeing growing popularity in related fields. However, the empirical question of whether and to what extent LLMs can reliably capture psycholinguistic judgments remains unanswered. In this paper, I apply LLMs to several major psycholinguistic datasets, quantify their performance, and discuss the advantages and disadvantages of this approach. Why do psycholinguists need psycholinguistic norms? These norms have multiple uses. First, experimentalists can use this information to normalize (or "norm") their stimuli. For example, a researcher designing a lexical decision task might ensure that the words in each condition are "matched"Can large language models help augment English psycholinguistic datasets? Sean Trott Abstract: Research on language and cognition relies extensively on psycholinguistic datasets or "norms". These datasets contain judgments of lexical properties like concreteness and age of acquisition, and can be used to norm experimental stimuli, discover empirical relationships in the lexicon, and stress-test computational models. However, collecting human judgments at scale is both time-consuming and expensive. This issue of scale is compounded for multi-dimensional norms and those incorporating context. The current work asks whether large language models (LLMs) can be leveraged to augment the creation of large, psycholinguistic datasets in English. I use GPT-4 to collect multiple kinds of semantic judgments (e.g., word similarity, contextualized sensorimotor associations, iconicity) for English words and compare these judgments against the human "gold standard". For each dataset, I find that GPT-4’s judgments are positively correlated with human judgments, in some cases rivaling or even exceeding the average inter-annotator agreement displayed by humans. I then identify several ways in which LLM-generated norms differ from human-generated norms systematically. I also perform several "substitution analyses", which demonstrate that replacing human-generated norms with LLM-generated norms in a statistical model does not change the sign of parameter estimates (though in select cases, there are significant changes to their magnitude). I conclude by discussing the considerations and limitations associated with LLM-generated norms in general, including concerns of data contamination, the choice of LLM, external validity, construct validity, and data quality. Additionally, all of GPT-4’s judgments (over 30,000 in total) are made available online for further analysis. Keywords: Dataset · Psycholinguistic resource · Large language models · ChatGPT Introduction: Research on language and cognition relies extensively on large, psycholinguistic datasets—sometimes called "norms". These datasets contain information about various properties of words and sentences, including concreteness, sensorimotor associations, affect, semantic similarity, iconicity, and more. Building these datasets is often time-consuming and expensive. One possible solution is to augment the construction of psycholinguistic datasets using computational tools, such as Large Language Models (LLMs), to reduce this difficulty; this approach is seeing growing popularity in related fields. However, the empirical question of whether and to what extent LLMs can reliably capture psycholinguistic judgments remains unanswered. In this paper, I apply LLMs to several major psycholinguistic datasets, quantify their performance, and discuss the advantages and disadvantages of this approach. Why do psycholinguists need psycholinguistic norms? These norms have multiple uses. First, experimentalists can use this information to normalize (or "norm") their stimuli. For example, a researcher designing a lexical decision task might ensure that the words in each condition are "matched"

Can large language models help augment English psycholinguistic datasets?

23 January 2024 | Sean Trott