2024 | Nikola Jovanović, Robin Staab, Martin Vechev
This paper presents a critical analysis of the vulnerability of large language model (LLM) watermarking schemes to a new type of attack called watermark stealing. The authors demonstrate that by querying the API of a watermarked LLM, an attacker can reverse-engineer the watermarking rules and use this knowledge to perform two downstream attacks: spoofing and scrubbing. Spoofing involves generating text that is detected as watermarked without knowing the secret key, while scrubbing involves removing the watermark from text. The authors show that these attacks can be carried out with minimal cost and high success rates, challenging the assumption that current watermarking schemes are secure.
The paper introduces an automated watermark stealing algorithm that can be applied in realistic scenarios to successfully mount both spoofing and scrubbing attacks on several state-of-the-art watermarking schemes. The authors demonstrate that for under $50, an attacker can reliably produce texts that are detected as watermarked with over 80% success rate. They also show that scrubbing attacks can be significantly boosted by the stolen watermarking rules, making it possible to remove watermarks from texts with high success rates.
The study highlights the need for more robust watermarking schemes and more thorough evaluations, as current schemes may be more vulnerable than previously thought. The authors argue that watermark stealing is a first-class threat to LLM watermarking and that future work should focus on developing more secure schemes. The paper also discusses the implications of these findings for the deployment of LLM watermarks, emphasizing the importance of addressing the vulnerabilities identified in this work.This paper presents a critical analysis of the vulnerability of large language model (LLM) watermarking schemes to a new type of attack called watermark stealing. The authors demonstrate that by querying the API of a watermarked LLM, an attacker can reverse-engineer the watermarking rules and use this knowledge to perform two downstream attacks: spoofing and scrubbing. Spoofing involves generating text that is detected as watermarked without knowing the secret key, while scrubbing involves removing the watermark from text. The authors show that these attacks can be carried out with minimal cost and high success rates, challenging the assumption that current watermarking schemes are secure.
The paper introduces an automated watermark stealing algorithm that can be applied in realistic scenarios to successfully mount both spoofing and scrubbing attacks on several state-of-the-art watermarking schemes. The authors demonstrate that for under $50, an attacker can reliably produce texts that are detected as watermarked with over 80% success rate. They also show that scrubbing attacks can be significantly boosted by the stolen watermarking rules, making it possible to remove watermarks from texts with high success rates.
The study highlights the need for more robust watermarking schemes and more thorough evaluations, as current schemes may be more vulnerable than previously thought. The authors argue that watermark stealing is a first-class threat to LLM watermarking and that future work should focus on developing more secure schemes. The paper also discusses the implications of these findings for the deployment of LLM watermarks, emphasizing the importance of addressing the vulnerabilities identified in this work.