8 Jan 2024 | Mike D'Arcy1, Tom Hope2,3, Larry Birnbaum1, Doug Downey1,3
The paper "MARG: Multi-Agent Review Generation for Scientific Papers" addresses the challenge of generating actionable peer-review feedback for scientific papers using large language models (LLMs). The authors propose a method called Multi-Agent Review Generation (MARG), which involves distributing the paper text across multiple LLM instances (agents) that engage in internal discussion. This approach allows the system to consume the full text of papers beyond the input length limitations of individual LLMs and improves the helpfulness and specificity of feedback by specializing agents for different comment types (experiments, clarity, impact).
In a user study, the baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time, with only 1.7 comments per paper rated as good overall. In contrast, the MARG-S system, which uses specialized agents, generated 3.7 good comments per paper, a 2.2x improvement over the baseline. The majority (71%) of MARG-S's comments were rated as specific, compared to generic comments in the baselines.
The authors also analyze the weaknesses of MARG-S, including high costs and internal communication errors, and suggest future directions for improvement. The contributions of the paper include a novel method for generating high-quality peer-review feedback, an evaluation of the quality of generated feedback using both automatic metrics and a user study, and a thorough analysis of the generated feedback.The paper "MARG: Multi-Agent Review Generation for Scientific Papers" addresses the challenge of generating actionable peer-review feedback for scientific papers using large language models (LLMs). The authors propose a method called Multi-Agent Review Generation (MARG), which involves distributing the paper text across multiple LLM instances (agents) that engage in internal discussion. This approach allows the system to consume the full text of papers beyond the input length limitations of individual LLMs and improves the helpfulness and specificity of feedback by specializing agents for different comment types (experiments, clarity, impact).
In a user study, the baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time, with only 1.7 comments per paper rated as good overall. In contrast, the MARG-S system, which uses specialized agents, generated 3.7 good comments per paper, a 2.2x improvement over the baseline. The majority (71%) of MARG-S's comments were rated as specific, compared to generic comments in the baselines.
The authors also analyze the weaknesses of MARG-S, including high costs and internal communication errors, and suggest future directions for improvement. The contributions of the paper include a novel method for generating high-quality peer-review feedback, an evaluation of the quality of generated feedback using both automatic metrics and a user study, and a thorough analysis of the generated feedback.