[slides and audio] Bypassing LLM Watermarks with Color-Aware Substitutions

The paper "Bypassing LLM Watermarks with Color-Aware Substitutions" by Qilong Wu and Varun Chandrasekaran from the University of Illinois Urbana-Champaign addresses the issue of detecting text generated by large language models (LLMs) versus human-written content. The authors propose a new attack method called Self Color Testing-based Substitution (SCTS), which is the first "color-aware" attack designed to evade watermarking techniques that bias LLMs to generate specific ("green") tokens. **Key Contributions:** 1. **Theoretical Analysis:** The paper theoretically shows that existing color-agnostic methods can only dilute watermarks and fail to evade detection for sufficiently long text segments. 2. **SCTS Method:** SCTS introduces new text fragments with fewer green tokens, determined by prompting the watermarked LLM to generate random strings and analyzing the frequency of words. This allows for color-aware substitution, effectively neutralizing the higher number of green tokens in the preserved fragments. 3. **Empirical Evaluation:** Extensive experiments on vicuna-7b-v1.5-16k and Llama-2-7b-chat-hf models demonstrate that SCTS is more effective than previous methods in reducing AUROC (a proxy for detection success) to less than 0.5 across various settings. **Background and Related Work:** - **Watermarking Approaches:** Kirchenbauer et al. (2023a) proposed a watermarking strategy that biases LLMs to generate more "green" tokens, which are then used to detect LLM-generated content. - **Attack Methods:** Previous attacks, such as paraphrasing and prompting, are color-agnostic and often fail to effectively remove watermarks, especially for long text segments. **Threat Model and Assumptions:** - The attack assumes access to the watermarked model, knowledge of the context size, and no knowledge of other watermarking hyperparameters. - The threat model focuses on realistic constraints, such as a reasonable number of edits, and does not assume access to an unwatermarked LLM. **Theoretical Results:** - The paper provides theoretical analysis showing that existing methods can only dilute watermarks and fail for sufficiently long text segments. - The probability of detection failure converges exponentially to zero as the text length increases. **Experimental Results:** - SCTS outperforms other baseline methods in reducing AUROC and maintaining semantic similarity. - The attack is effective even with a low normalized edit distance of 0.25-0.35. - The success of SCTS is linked to the ability of the model to follow instructions, with more aligned and instruction fine-tuned models performing better. **Discussion and Future Work:** - The paper discusses limitations and open questions, such as improving efficiency, increasing accuracy, and handling unknown context sizes. - The implications of the attack on misinformation and dual-use of LLMs are highlightedThe paper "Bypassing LLM Watermarks with Color-Aware Substitutions" by Qilong Wu and Varun Chandrasekaran from the University of Illinois Urbana-Champaign addresses the issue of detecting text generated by large language models (LLMs) versus human-written content. The authors propose a new attack method called Self Color Testing-based Substitution (SCTS), which is the first "color-aware" attack designed to evade watermarking techniques that bias LLMs to generate specific ("green") tokens. **Key Contributions:** 1. **Theoretical Analysis:** The paper theoretically shows that existing color-agnostic methods can only dilute watermarks and fail to evade detection for sufficiently long text segments. 2. **SCTS Method:** SCTS introduces new text fragments with fewer green tokens, determined by prompting the watermarked LLM to generate random strings and analyzing the frequency of words. This allows for color-aware substitution, effectively neutralizing the higher number of green tokens in the preserved fragments. 3. **Empirical Evaluation:** Extensive experiments on vicuna-7b-v1.5-16k and Llama-2-7b-chat-hf models demonstrate that SCTS is more effective than previous methods in reducing AUROC (a proxy for detection success) to less than 0.5 across various settings. **Background and Related Work:** - **Watermarking Approaches:** Kirchenbauer et al. (2023a) proposed a watermarking strategy that biases LLMs to generate more "green" tokens, which are then used to detect LLM-generated content. - **Attack Methods:** Previous attacks, such as paraphrasing and prompting, are color-agnostic and often fail to effectively remove watermarks, especially for long text segments. **Threat Model and Assumptions:** - The attack assumes access to the watermarked model, knowledge of the context size, and no knowledge of other watermarking hyperparameters. - The threat model focuses on realistic constraints, such as a reasonable number of edits, and does not assume access to an unwatermarked LLM. **Theoretical Results:** - The paper provides theoretical analysis showing that existing methods can only dilute watermarks and fail for sufficiently long text segments. - The probability of detection failure converges exponentially to zero as the text length increases. **Experimental Results:** - SCTS outperforms other baseline methods in reducing AUROC and maintaining semantic similarity. - The attack is effective even with a low normalized edit distance of 0.25-0.35. - The success of SCTS is linked to the ability of the model to follow instructions, with more aligned and instruction fine-tuned models performing better. **Discussion and Future Work:** - The paper discusses limitations and open questions, such as improving efficiency, increasing accuracy, and handling unknown context sizes. - The implications of the attack on misinformation and dual-use of LLMs are highlighted

Bypassing LLM Watermarks with Color-Aware Substitutions

19 Mar 2024 | Qilong Wu, Varun Chandrasekaran