25 Jun 2024 | Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Peng Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Srinath, Haoran Ranran Zhang, Vipul Gupta, Yinghui Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia, Chen Xing, Jiayang Cheng, Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Guo, Jing Gu, Haoran Li, Kangda Wei, Zihao Wang, Lu Cheng, Surangika Ranathunga, Meng Fang, Jie Fu, Fei Liu, Ruihong Huang, Eduardo Blanco, Yixin Cao, Rui Zhang, Philip S. Yu, Wengpeng Yin
This paper explores the potential of large language models (LLMs) in assisting researchers, particularly in the context of paper (meta-)reviewing. The authors aim to identify and distinguish LLM activities from human activities and increase community awareness about the limitations of LLMs in tasks requiring high expertise and nuanced judgment. The study focuses on two main research goals: enabling better recognition of instances where LLMs are implicitly used for reviewing activities, and highlighting that LLMs are currently inadequate for tasks demanding a high level of expertise.
To achieve these goals, the authors constructed the ReviewCritique dataset, which includes NLP papers with both human-written and LLM-generated reviews, along with detailed deficiency labels and explanations annotated by experts. The dataset is used to explore two research questions: "LLMs as Reviewers" and "LLMs as Metareviewers."
1. **LLMs as Reviewers**: The study compares the quality and distinguishability of LLM-generated reviews compared to human-written reviews. It proposes a novel metric to measure LLM-generated review diversity. The findings indicate that LLMs generate more deficient review segments than humans and often produce paper-specific reviews lacking diversity and constructive feedback.
2. **LLMs as Metareviewers**: The study evaluates LLMs' ability to identify deficient segments in human-written reviews. It contrasts this task with other works that treat paper meta-review as a text summarization task. The results show that even top-tier LLMs struggle to mimic human experts in assessing individual reviews, highlighting the importance of human expertise in meta-reviewing.
The paper concludes by discussing the limitations of the current study, such as the focus on NLP domain and the pre-rebuttal phase of the peer review process. It suggests future directions for expanding the dataset and exploring LLMs' capabilities in other research areas.This paper explores the potential of large language models (LLMs) in assisting researchers, particularly in the context of paper (meta-)reviewing. The authors aim to identify and distinguish LLM activities from human activities and increase community awareness about the limitations of LLMs in tasks requiring high expertise and nuanced judgment. The study focuses on two main research goals: enabling better recognition of instances where LLMs are implicitly used for reviewing activities, and highlighting that LLMs are currently inadequate for tasks demanding a high level of expertise.
To achieve these goals, the authors constructed the ReviewCritique dataset, which includes NLP papers with both human-written and LLM-generated reviews, along with detailed deficiency labels and explanations annotated by experts. The dataset is used to explore two research questions: "LLMs as Reviewers" and "LLMs as Metareviewers."
1. **LLMs as Reviewers**: The study compares the quality and distinguishability of LLM-generated reviews compared to human-written reviews. It proposes a novel metric to measure LLM-generated review diversity. The findings indicate that LLMs generate more deficient review segments than humans and often produce paper-specific reviews lacking diversity and constructive feedback.
2. **LLMs as Metareviewers**: The study evaluates LLMs' ability to identify deficient segments in human-written reviews. It contrasts this task with other works that treat paper meta-review as a text summarization task. The results show that even top-tier LLMs struggle to mimic human experts in assessing individual reviews, highlighting the importance of human expertise in meta-reviewing.
The paper concludes by discussing the limitations of the current study, such as the focus on NLP domain and the pre-rebuttal phase of the peer review process. It suggests future directions for expanding the dataset and exploring LLMs' capabilities in other research areas.