Proving membership in LLM pretraining data via data watermarks

Proving membership in LLM pretraining data via data watermarks

17 Aug 2024 | Johnny Tian-Zheng Wei*, Ryan Yixiang Wang*, Robin Jia
This paper proposes using data watermarks to detect whether a large language model (LLM) has been trained on a specific dataset. The method involves inserting a randomly generated watermark into the training data, which can then be used to perform hypothesis testing to determine if the model has learned the watermark. The watermark is designed to be statistically significant, allowing for the detection of whether the model has been trained on the data. Two types of watermarks are studied: one that inserts random character sequences and another that substitutes characters with Unicode lookalikes. The effectiveness of the watermark is influenced by factors such as the length of the watermark, the number of duplicated documents, and the level of interference. The study shows that increasing the size of the model or the training dataset can enhance the strength of the watermark. Additionally, the paper demonstrates that SHA hashes can be used as natural watermarks, and that they can be reliably detected if they appear at least 90 times in the training data. The results suggest that data watermarks can be a promising tool for detecting data usage in LLMs, with potential applications in real-world scenarios. The study also highlights the importance of statistical guarantees in detecting data usage, and the need for further research into the design and effectiveness of data watermarks.This paper proposes using data watermarks to detect whether a large language model (LLM) has been trained on a specific dataset. The method involves inserting a randomly generated watermark into the training data, which can then be used to perform hypothesis testing to determine if the model has learned the watermark. The watermark is designed to be statistically significant, allowing for the detection of whether the model has been trained on the data. Two types of watermarks are studied: one that inserts random character sequences and another that substitutes characters with Unicode lookalikes. The effectiveness of the watermark is influenced by factors such as the length of the watermark, the number of duplicated documents, and the level of interference. The study shows that increasing the size of the model or the training dataset can enhance the strength of the watermark. Additionally, the paper demonstrates that SHA hashes can be used as natural watermarks, and that they can be reliably detected if they appear at least 90 times in the training data. The results suggest that data watermarks can be a promising tool for detecting data usage in LLMs, with potential applications in real-world scenarios. The study also highlights the importance of statistical guarantees in detecting data usage, and the need for further research into the design and effectiveness of data watermarks.
Reach us at info@study.space