Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models

Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models

15 May 2024 | Anna A. Ivanova, Aalok Sathe, Benjamin Lipkin, Unnathi Kumar, Setayesh Radkani, Thomas H. Clark, Carina Kauf, Jennifer Hu, R. T. Pramod, Gabriel Grand, Vivian Paulun, Maria Ryskina, Ekin Akyürek, Ethan Wilcox, Nafisa Rashid, Leshem Choshen, Roger Levy, Evelina Fedorenko, Joshua Tenenbaum, Jacob Andreas
The paper introduces Elements of World Knowledge (EWoK), a framework designed to evaluate the world modeling capabilities of language models (LLMs). EWoK focuses on testing LLMs' ability to use knowledge of specific concepts to match a target text with a plausible or implausible context. The framework targets specific concepts from multiple knowledge domains, including social interactions and spatial relations, and uses minimal pairs of contexts and targets to generate controlled datasets. The authors introduce EWoK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains. They evaluate 20 large LLMs (ranging from 1.3B to 70B parameters) using various evaluation paradigms, including LOGPROBS, LIKERT, and CHOICE. The results show that all tested models perform worse than humans, with significant variations across domains. The study highlights the need for targeted research to improve LLMs' world modeling capabilities and provides a flexible framework for future experiments and interpretability research.The paper introduces Elements of World Knowledge (EWoK), a framework designed to evaluate the world modeling capabilities of language models (LLMs). EWoK focuses on testing LLMs' ability to use knowledge of specific concepts to match a target text with a plausible or implausible context. The framework targets specific concepts from multiple knowledge domains, including social interactions and spatial relations, and uses minimal pairs of contexts and targets to generate controlled datasets. The authors introduce EWoK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains. They evaluate 20 large LLMs (ranging from 1.3B to 70B parameters) using various evaluation paradigms, including LOGPROBS, LIKERT, and CHOICE. The results show that all tested models perform worse than humans, with significant variations across domains. The study highlights the need for targeted research to improve LLMs' world modeling capabilities and provides a flexible framework for future experiments and interpretability research.
Reach us at info@study.space