27 Jun 2024 | Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohammed Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
M4GT-Bench is a new benchmark for detecting machine-generated text (MGT) in a multilingual, multidomain, and multi-generator setting. The benchmark includes three tasks: (1) binary classification of human-written vs. machine-generated text, (2) multi-way detection of the specific generator responsible for the text, and (3) detection of the boundary between human-written and machine-generated text in mixed content. The benchmark is built on a diverse corpus of nine languages, six domains, and nine LLM generators, including GPT-4 and LLaMA-2.
The benchmark evaluates various MGT detection baselines and human performance. Results show that achieving good performance in MGT detection typically requires access to training data from the same domain and generators. Human performance is found to be only marginally better than random chance in distinguishing MGT from human-written text.
The benchmark includes a multilingual dataset with new languages such as German, Italian, and Arabic, and is designed to evaluate the generalization ability of detectors across different domains and generators. For the multi-way detection task, the benchmark includes six generators and evaluates the ability to identify the specific generator responsible for the text.
For the boundary detection task, the benchmark includes mixed texts generated by different LLMs and evaluates the ability to detect the transition point between human-written and machine-generated text. The benchmark is designed to reflect real-world scenarios where text is partially written by humans and partially generated by machines.
The benchmark is evaluated using metrics such as accuracy, precision, recall, and F1-score. Results show that detectors based on GLTR features perform well in certain settings, while NELA features are less effective. The benchmark also highlights the challenges of detecting MGT in unseen domains and generators, and the limitations of current detection methods.
Overall, the benchmark provides a comprehensive evaluation of MGT detection methods and highlights the need for further research in this area. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.M4GT-Bench is a new benchmark for detecting machine-generated text (MGT) in a multilingual, multidomain, and multi-generator setting. The benchmark includes three tasks: (1) binary classification of human-written vs. machine-generated text, (2) multi-way detection of the specific generator responsible for the text, and (3) detection of the boundary between human-written and machine-generated text in mixed content. The benchmark is built on a diverse corpus of nine languages, six domains, and nine LLM generators, including GPT-4 and LLaMA-2.
The benchmark evaluates various MGT detection baselines and human performance. Results show that achieving good performance in MGT detection typically requires access to training data from the same domain and generators. Human performance is found to be only marginally better than random chance in distinguishing MGT from human-written text.
The benchmark includes a multilingual dataset with new languages such as German, Italian, and Arabic, and is designed to evaluate the generalization ability of detectors across different domains and generators. For the multi-way detection task, the benchmark includes six generators and evaluates the ability to identify the specific generator responsible for the text.
For the boundary detection task, the benchmark includes mixed texts generated by different LLMs and evaluates the ability to detect the transition point between human-written and machine-generated text. The benchmark is designed to reflect real-world scenarios where text is partially written by humans and partially generated by machines.
The benchmark is evaluated using metrics such as accuracy, precision, recall, and F1-score. Results show that detectors based on GLTR features perform well in certain settings, while NELA features are less effective. The benchmark also highlights the challenges of detecting MGT in unseen domains and generators, and the limitations of current detection methods.
Overall, the benchmark provides a comprehensive evaluation of MGT detection methods and highlights the need for further research in this area. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.