1 Nov 2024 | Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn
The paper addresses the challenge of adversarial training in large language models (LLMs) to enhance their robustness against discrete attacks. Current methods for adversarial training are computationally expensive due to the need for discrete attacks at each training iteration. The authors propose a continuous adversarial training (CAT) algorithm that calculates attacks in the continuous embedding space, which is significantly more efficient. CAT consists of two losses: one to make the model robust against continuous embedding attacks and another to fine-tune the model on utility data. They also introduce CAPO, an adversarial variant of identity preference optimization (IPO) that does not require utility data for adversarial alignment. Empirical evaluations on five LLMs (Gemma, Phi3, Mistral, Zephyr, Llama2) across different scales (2B, 3.8B, 7B) show that both algorithms significantly enhance robustness against discrete attacks while maintaining utility. The results demonstrate that robustness to continuous perturbations extrapolates to discrete threat models, providing a scalable approach to adversarial training for LLMs. The paper also highlights the importance of careful evaluation protocols to avoid overfitting and misleading assessments of robustness and utility.The paper addresses the challenge of adversarial training in large language models (LLMs) to enhance their robustness against discrete attacks. Current methods for adversarial training are computationally expensive due to the need for discrete attacks at each training iteration. The authors propose a continuous adversarial training (CAT) algorithm that calculates attacks in the continuous embedding space, which is significantly more efficient. CAT consists of two losses: one to make the model robust against continuous embedding attacks and another to fine-tune the model on utility data. They also introduce CAPO, an adversarial variant of identity preference optimization (IPO) that does not require utility data for adversarial alignment. Empirical evaluations on five LLMs (Gemma, Phi3, Mistral, Zephyr, Llama2) across different scales (2B, 3.8B, 7B) show that both algorithms significantly enhance robustness against discrete attacks while maintaining utility. The results demonstrate that robustness to continuous perturbations extrapolates to discrete threat models, providing a scalable approach to adversarial training for LLMs. The paper also highlights the importance of careful evaluation protocols to avoid overfitting and misleading assessments of robustness and utility.