Language Models as Science Tutors

Language Models as Science Tutors

2024 | Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodríguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen
This paper introduces TUTOREVAL and TUTORCHAT, two new benchmarks for evaluating language models (LMs) as science tutors. TUTOREVAL is a question-answering benchmark consisting of questions about long chapters from STEM textbooks, designed to measure the real-life usability of LMs as scientific assistants. TUTORCHAT is a large synthetic dialogue dataset of 80,000 long dialogues about textbooks, created to improve performance on TUTOREVAL. The paper shows that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL, and that TUTORCHAT is a rich resource for domain-specific fine-tuning. The authors also introduce two long-context LMs, Llemma-7B-32KMathMix and Llemma-34B-MathMix, which excel at TUTOREVAL and perform strongly on GSM8K and MATH. The paper also discusses the importance of training and fine-tuning with scientific texts, and shows that TUTORCHAT helps mitigate sycophancy in dialogues. The authors release their models, data, and evaluations publicly.This paper introduces TUTOREVAL and TUTORCHAT, two new benchmarks for evaluating language models (LMs) as science tutors. TUTOREVAL is a question-answering benchmark consisting of questions about long chapters from STEM textbooks, designed to measure the real-life usability of LMs as scientific assistants. TUTORCHAT is a large synthetic dialogue dataset of 80,000 long dialogues about textbooks, created to improve performance on TUTOREVAL. The paper shows that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL, and that TUTORCHAT is a rich resource for domain-specific fine-tuning. The authors also introduce two long-context LMs, Llemma-7B-32KMathMix and Llemma-34B-MathMix, which excel at TUTOREVAL and perform strongly on GSM8K and MATH. The paper also discusses the importance of training and fine-tuning with scientific texts, and shows that TUTORCHAT helps mitigate sycophancy in dialogues. The authors release their models, data, and evaluations publicly.
Reach us at info@study.space