Bypassing LLM Watermarks with Color-Aware Substitutions

Bypassing LLM Watermarks with Color-Aware Substitutions

19 Mar 2024 | Qilong Wu, Varun Chandrasekaran
This paper introduces a novel attack method called Self Color Testing-based Substitution (SCTS) to bypass watermarking in large language models (LLMs). The proposed method is the first "color-aware" attack that leverages color information from the LLM's output to substitute green tokens with non-green ones, thereby evading detection. The key idea is to prompt the LLM to generate text and analyze the frequency of tokens to determine their color, then use this information to replace green tokens with red ones. This approach is more effective than existing color-agnostic methods, which fail to evade detection for sufficiently long watermarked text. The paper evaluates the effectiveness of SCTS under various edit distance budgets and shows that it can successfully evade watermark detection with fewer edits than previous methods. Theoretical analysis demonstrates that SCTS can remove watermarks from arbitrarily long text segments within a reasonable edit distance budget. The method is also shown to be theoretically grounded, with a comprehensive analysis of its efficacy and costs under reasonable edit distance constraints. The study compares SCTS with existing attack methods on two LLMs and two watermarking schemes, showing that SCTS significantly reduces the area under the receiver operating characteristic curve (AUROC), a proxy for detection success, to below 0.5. Additionally, the paper highlights the importance of considering the alignment and instruction fine-tuning of models in the threat model, as these factors influence the effectiveness of attacks. The results indicate that SCTS is more effective than existing methods, particularly for models that are aligned and instruction fine-tuned. The paper also discusses the limitations of SCTS, including its efficiency and the accuracy of color information, and suggests areas for future research to improve the method. Overall, the study provides a comprehensive analysis of the effectiveness of SCTS in bypassing LLM watermarking and highlights the importance of developing more robust watermarking techniques.This paper introduces a novel attack method called Self Color Testing-based Substitution (SCTS) to bypass watermarking in large language models (LLMs). The proposed method is the first "color-aware" attack that leverages color information from the LLM's output to substitute green tokens with non-green ones, thereby evading detection. The key idea is to prompt the LLM to generate text and analyze the frequency of tokens to determine their color, then use this information to replace green tokens with red ones. This approach is more effective than existing color-agnostic methods, which fail to evade detection for sufficiently long watermarked text. The paper evaluates the effectiveness of SCTS under various edit distance budgets and shows that it can successfully evade watermark detection with fewer edits than previous methods. Theoretical analysis demonstrates that SCTS can remove watermarks from arbitrarily long text segments within a reasonable edit distance budget. The method is also shown to be theoretically grounded, with a comprehensive analysis of its efficacy and costs under reasonable edit distance constraints. The study compares SCTS with existing attack methods on two LLMs and two watermarking schemes, showing that SCTS significantly reduces the area under the receiver operating characteristic curve (AUROC), a proxy for detection success, to below 0.5. Additionally, the paper highlights the importance of considering the alignment and instruction fine-tuning of models in the threat model, as these factors influence the effectiveness of attacks. The results indicate that SCTS is more effective than existing methods, particularly for models that are aligned and instruction fine-tuned. The paper also discusses the limitations of SCTS, including its efficiency and the accuracy of color information, and suggests areas for future research to improve the method. Overall, the study provides a comprehensive analysis of the effectiveness of SCTS in bypassing LLM watermarking and highlights the importance of developing more robust watermarking techniques.
Reach us at info@study.space
[slides] Bypassing LLM Watermarks with Color-Aware Substitutions | StudySpace