4 Jul 2024 | Vyas Raina*, Adian Liusie*, Mark Gales
This paper investigates the robustness of large language models (LLMs) as judges in zero-shot assessment tasks, focusing on their vulnerability to universal adversarial attacks. The study demonstrates that short universal adversarial phrases can be concatenated to text to deceive LLMs into predicting inflated scores, regardless of the text's quality. The research proposes a surrogate attack method where an adversarial phrase is first learned on a surrogate model and then transferred to unknown judge-LLMs. The results show that these universal attack phrases can significantly inflate scores when applied to unseen models, particularly in absolute scoring scenarios. Comparative assessment is found to be more robust than absolute scoring to such attacks, although direct attacks on the surrogate model can still yield inflated scores. The study also explores detection methods, such as using perplexity scores, to identify adversarial examples. The findings highlight the need to address vulnerabilities in LLM assessment methods before deployment in high-stakes real-world scenarios. The research underscores the importance of considering adversarial robustness in LLM-as-a-judge systems to ensure their reliability and fairness.This paper investigates the robustness of large language models (LLMs) as judges in zero-shot assessment tasks, focusing on their vulnerability to universal adversarial attacks. The study demonstrates that short universal adversarial phrases can be concatenated to text to deceive LLMs into predicting inflated scores, regardless of the text's quality. The research proposes a surrogate attack method where an adversarial phrase is first learned on a surrogate model and then transferred to unknown judge-LLMs. The results show that these universal attack phrases can significantly inflate scores when applied to unseen models, particularly in absolute scoring scenarios. Comparative assessment is found to be more robust than absolute scoring to such attacks, although direct attacks on the surrogate model can still yield inflated scores. The study also explores detection methods, such as using perplexity scores, to identify adversarial examples. The findings highlight the need to address vulnerabilities in LLM assessment methods before deployment in high-stakes real-world scenarios. The research underscores the importance of considering adversarial robustness in LLM-as-a-judge systems to ensure their reliability and fairness.