2024 | Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodriguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen
The paper introduces TUTOREVAL and TUTORCHAT, two datasets designed to evaluate and improve the scientific problem-solving capabilities of large language models (LMs). TUTOREVAL is a diverse question-answering benchmark that includes questions about long chapters from STEM textbooks, written by experts. It aims to measure the real-life usability of LMs as scientific assistants and is the first benchmark to combine long contexts, free-form generation, and multidisciplinary scientific knowledge. TUTORCHAT is a synthetic dialogue dataset consisting of 80,000 long conversations about textbook chapters, covering STEM topics, humanities, and social sciences. The paper demonstrates that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL. Therefore, it creates TUTOR-CHAT to fine-tune LLMs with 7B and 34B parameters, resulting in specialized LM tutors that excel on TUTOREVAL and perform strongly on math and general knowledge benchmarks. The paper also introduces MathMix, a data mixture that combines STEM dialogues with math data, leading to well-rounded LM tutors with strong math problem-solving skills. The authors release their models, data, and evaluations publicly, aiming to advance the field of NLP and encourage the development of LMs as useful scientific assistants.The paper introduces TUTOREVAL and TUTORCHAT, two datasets designed to evaluate and improve the scientific problem-solving capabilities of large language models (LMs). TUTOREVAL is a diverse question-answering benchmark that includes questions about long chapters from STEM textbooks, written by experts. It aims to measure the real-life usability of LMs as scientific assistants and is the first benchmark to combine long contexts, free-form generation, and multidisciplinary scientific knowledge. TUTORCHAT is a synthetic dialogue dataset consisting of 80,000 long conversations about textbook chapters, covering STEM topics, humanities, and social sciences. The paper demonstrates that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL. Therefore, it creates TUTOR-CHAT to fine-tune LLMs with 7B and 34B parameters, resulting in specialized LM tutors that excel on TUTOREVAL and perform strongly on math and general knowledge benchmarks. The paper also introduces MathMix, a data mixture that combines STEM dialogues with math data, leading to well-rounded LM tutors with strong math problem-solving skills. The authors release their models, data, and evaluations publicly, aiming to advance the field of NLP and encourage the development of LMs as useful scientific assistants.