[slides] Stumbling Blocks%3A Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks

This paper investigates the robustness of machine-generated text (MGT) detectors against various attacks, revealing significant vulnerabilities. The study evaluates 8 popular MGT detectors under 12 realistic attack scenarios, including editing, paraphrasing, prompting, and co-generating. The results show that almost none of the detectors remain robust against all attacks, with performance dropping by up to 35% across all attacks when averaged. Watermarking is found to be the most robust method, followed by model-based detectors. The study also proposes initial out-of-the-box patches to improve robustness. The paper introduces a suite of attacks under realistic scenarios, with a focus on the impact of different perturbation levels (budgets) on detector performance. Editing attacks, such as typo insertion and homoglyph alteration, significantly degrade the performance of metric-based detectors. Paraphrasing attacks, including synonym substitution and sentence-level paraphrasing, also pose challenges, with metric-based detectors showing particular weakness. Prompting attacks, such as prompt paraphrasing and in-context learning, can severely impact fine-tuned detectors. Co-generating attacks, which introduce typos or emojis during generation, also affect detector performance. The study highlights the need for more robust MGT detection methods and calls for awareness of diverse attacks. It also emphasizes the importance of evaluating detectors under realistic scenarios and the need for further research into defense mechanisms. The paper provides a comprehensive analysis of the vulnerabilities of current MGT detectors and proposes initial solutions to improve their robustness. The findings underscore the importance of developing more resilient detection methods to counteract malicious attacks on machine-generated text.This paper investigates the robustness of machine-generated text (MGT) detectors against various attacks, revealing significant vulnerabilities. The study evaluates 8 popular MGT detectors under 12 realistic attack scenarios, including editing, paraphrasing, prompting, and co-generating. The results show that almost none of the detectors remain robust against all attacks, with performance dropping by up to 35% across all attacks when averaged. Watermarking is found to be the most robust method, followed by model-based detectors. The study also proposes initial out-of-the-box patches to improve robustness. The paper introduces a suite of attacks under realistic scenarios, with a focus on the impact of different perturbation levels (budgets) on detector performance. Editing attacks, such as typo insertion and homoglyph alteration, significantly degrade the performance of metric-based detectors. Paraphrasing attacks, including synonym substitution and sentence-level paraphrasing, also pose challenges, with metric-based detectors showing particular weakness. Prompting attacks, such as prompt paraphrasing and in-context learning, can severely impact fine-tuned detectors. Co-generating attacks, which introduce typos or emojis during generation, also affect detector performance. The study highlights the need for more robust MGT detection methods and calls for awareness of diverse attacks. It also emphasizes the importance of evaluating detectors under realistic scenarios and the need for further research into defense mechanisms. The paper provides a comprehensive analysis of the vulnerabilities of current MGT detectors and proposes initial solutions to improve their robustness. The findings underscore the importance of developing more resilient detection methods to counteract malicious attacks on machine-generated text.

Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks

18 Feb 2024 | Yichen Wang, Shangbin Feng, Abe Bohan Hou, Xiao Pu, Chao Shen, Xiaoming Liu, Yulia Tsvetkov, Tianxing He