[slides] Watermark Stealing in Large Language Models

This paper addresses the vulnerability of watermarking schemes in large language models (LLMs) to watermark stealing, a fundamental threat that enables both spoofing and scrubbing attacks. The authors propose an automated watermark stealing algorithm and conduct a comprehensive study in realistic settings, demonstrating that under $850, an attacker can successfully spoof and scrub state-of-the-art watermarking schemes with an average success rate of over 80%. This challenges the common belief that current watermarking schemes are robust and highlights the need for more robust and thoroughly evaluated methods. The paper also discusses the implications of these findings for the deployment of LLM watermarks and provides a detailed analysis of the threat model, algorithm, and experimental results.This paper addresses the vulnerability of watermarking schemes in large language models (LLMs) to watermark stealing, a fundamental threat that enables both spoofing and scrubbing attacks. The authors propose an automated watermark stealing algorithm and conduct a comprehensive study in realistic settings, demonstrating that under $850, an attacker can successfully spoof and scrub state-of-the-art watermarking schemes with an average success rate of over 80%. This challenges the common belief that current watermarking schemes are robust and highlights the need for more robust and thoroughly evaluated methods. The paper also discusses the implications of these findings for the deployment of LLM watermarks and provides a detailed analysis of the threat model, algorithm, and experimental results.

Watermark Stealing in Large Language Models

24 Jun 2024 | Nikola Jovanović, Robin Staab, Martin Vechev