LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing

LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing

25 Jun 2024 | Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi Liu, Renze Lou, Henry Peng Zou, Pranav Narayanan Venkit, Nan Zhang, Mukund Srinath, Haoran Ranran Zhang, Vipul Gupta, Yinghui Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia, Chen Xing, Jiayang Cheng, Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Guo, Jing Gu, Haoran Li, Kangda Wei, Zihao Wang, Lu Cheng, Surangika Ranathunga, Meng Fang, Jie Fu, Fei Liu, Ruihong Huang, Eduardo Blanco, Yixin Cao, Rui Zhang, Philip S. Yu, Wenpeng Yin
This paper investigates the potential of large language models (LLMs) in assisting natural language processing (NLP) researchers, particularly in the context of paper (meta-)reviewing. The authors argue that LLMs are not yet capable of performing tasks requiring high expertise and nuanced judgment, and thus, they present a comparative analysis to distinguish LLM activities from human activities. The study introduces the ReviewCritique dataset, which includes human-written and LLM-generated reviews of NLP papers, along with detailed annotations of review deficiencies at the sentence level. The dataset was curated by selecting NLP papers submitted to top-tier AI conferences, collecting both human-written and LLM-generated reviews, and annotating them with expert evaluations. The reviews were segmented into sentences, and each sentence was labeled as either deficient or not, along with explanations. The dataset enables a detailed analysis of LLMs' performance as reviewers and meta-reviewers. The study evaluates two research questions: (i) How do LLM-generated reviews compare with human-written reviews in terms of quality and distinguishability? (ii) How effectively can LLMs identify deficient segments in human-written reviews? The results show that LLMs often produce deficient and paper-unspecific reviews, lacking diversity and constructive feedback. Additionally, even state-of-the-art LLMs struggle to assess review deficiencies effectively. The paper also introduces a novel metric, ITF-IDF, to measure the specificity of LLM-generated reviews across different papers. The results indicate that human reviews are more diverse and specific compared to LLM-generated reviews. Furthermore, the study evaluates the performance of various LLMs in identifying deficient segments, finding that while some models perform well, they still struggle with certain types of errors, such as inaccurate summaries and superficial reviews. Overall, the study highlights the current limitations of LLMs in automating the peer review process and emphasizes the importance of human expertise in this task. The ReviewCritique dataset provides a valuable resource for future research on AI-assisted peer review and LLM benchmarking.This paper investigates the potential of large language models (LLMs) in assisting natural language processing (NLP) researchers, particularly in the context of paper (meta-)reviewing. The authors argue that LLMs are not yet capable of performing tasks requiring high expertise and nuanced judgment, and thus, they present a comparative analysis to distinguish LLM activities from human activities. The study introduces the ReviewCritique dataset, which includes human-written and LLM-generated reviews of NLP papers, along with detailed annotations of review deficiencies at the sentence level. The dataset was curated by selecting NLP papers submitted to top-tier AI conferences, collecting both human-written and LLM-generated reviews, and annotating them with expert evaluations. The reviews were segmented into sentences, and each sentence was labeled as either deficient or not, along with explanations. The dataset enables a detailed analysis of LLMs' performance as reviewers and meta-reviewers. The study evaluates two research questions: (i) How do LLM-generated reviews compare with human-written reviews in terms of quality and distinguishability? (ii) How effectively can LLMs identify deficient segments in human-written reviews? The results show that LLMs often produce deficient and paper-unspecific reviews, lacking diversity and constructive feedback. Additionally, even state-of-the-art LLMs struggle to assess review deficiencies effectively. The paper also introduces a novel metric, ITF-IDF, to measure the specificity of LLM-generated reviews across different papers. The results indicate that human reviews are more diverse and specific compared to LLM-generated reviews. Furthermore, the study evaluates the performance of various LLMs in identifying deficient segments, finding that while some models perform well, they still struggle with certain types of errors, such as inaccurate summaries and superficial reviews. Overall, the study highlights the current limitations of LLMs in automating the peer review process and emphasizes the importance of human expertise in this task. The ReviewCritique dataset provides a valuable resource for future research on AI-assisted peer review and LLM benchmarking.
Reach us at info@study.space
Understanding LLMs Assist NLP Researchers%3A Critique Paper (Meta-)Reviewing