30 Mar 2024 | Qihui Zhang, Chujie Gao, Dongping Chen, Yue Huang, Yixin Huang, Zhenyang Sun, Shilin Zhang, Weiye Li, Zhengyan Fu, Yao Wan, Lichao Sun
The paper "LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?" addresses the challenge of detecting mixed text, which involves both human-written (HWT) and machine-generated (MGT) content. The authors define *mixtext* as a form of mixed text and introduce MIXSET, a dataset specifically designed to study these mixtext scenarios. MIXSET includes 3.6k instances of mixtext, generated through various operations such as polishing, rewriting, and adapting. The dataset is constructed to reflect real-world usage patterns and is manually reviewed to ensure quality.
The paper evaluates the performance of existing detectors on MIXSET, using both metric-based and model-based methods. The results show that current detectors struggle to identify mixtext, particularly in subtle modifications and style adaptability. The study highlights the need for more fine-grained detectors tailored for mixtext scenarios. The authors also explore the impact of retraining detectors on MIXSET and the generalization ability of these detectors across different subsets of mixtext and LLMs. The findings suggest that current detectors have limited generalization capabilities and that increasing the size of the training set can enhance detection performance, but adding pure text samples may have a negative impact.
The paper concludes by emphasizing the importance of developing more robust and fine-grained detection methods to address the challenges posed by mixtext. It also discusses ethical considerations, such as the misuse of mixtext to evade detection, and the need for responsible use of these technologies.The paper "LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?" addresses the challenge of detecting mixed text, which involves both human-written (HWT) and machine-generated (MGT) content. The authors define *mixtext* as a form of mixed text and introduce MIXSET, a dataset specifically designed to study these mixtext scenarios. MIXSET includes 3.6k instances of mixtext, generated through various operations such as polishing, rewriting, and adapting. The dataset is constructed to reflect real-world usage patterns and is manually reviewed to ensure quality.
The paper evaluates the performance of existing detectors on MIXSET, using both metric-based and model-based methods. The results show that current detectors struggle to identify mixtext, particularly in subtle modifications and style adaptability. The study highlights the need for more fine-grained detectors tailored for mixtext scenarios. The authors also explore the impact of retraining detectors on MIXSET and the generalization ability of these detectors across different subsets of mixtext and LLMs. The findings suggest that current detectors have limited generalization capabilities and that increasing the size of the training set can enhance detection performance, but adding pure text samples may have a negative impact.
The paper concludes by emphasizing the importance of developing more robust and fine-grained detection methods to address the challenges posed by mixtext. It also discusses ethical considerations, such as the misuse of mixtext to evade detection, and the need for responsible use of these technologies.