23 May 2024 | Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank Rudzicz
The paper introduces Representation Noising (RepNoise), a defense mechanism against harmful fine-tuning (HFA) attacks on large language models (LLMs). HFA attacks can be performed even without access to the model weights, making closed models vulnerable. RepNoise works by removing information about harmful representations, making it difficult to recover them during fine-tuning. The method is effective even when the defender no longer has control over the model after the attacker gains access to its weights. The authors provide theoretical and empirical evidence that RepNoise effectively prevents HFA attacks while maintaining the model's ability to train on harmless tasks. The effectiveness of RepNoise is attributed to its "depth," meaning it removes harmful information across all layers of the LLM. The paper also discusses the limitations of RepNoise and suggests future directions for research, including stronger attack settings and cross-domain defense.The paper introduces Representation Noising (RepNoise), a defense mechanism against harmful fine-tuning (HFA) attacks on large language models (LLMs). HFA attacks can be performed even without access to the model weights, making closed models vulnerable. RepNoise works by removing information about harmful representations, making it difficult to recover them during fine-tuning. The method is effective even when the defender no longer has control over the model after the attacker gains access to its weights. The authors provide theoretical and empirical evidence that RepNoise effectively prevents HFA attacks while maintaining the model's ability to train on harmless tasks. The effectiveness of RepNoise is attributed to its "depth," meaning it removes harmful information across all layers of the LLM. The paper also discusses the limitations of RepNoise and suggests future directions for research, including stronger attack settings and cross-domain defense.