4 Jul 2024 | Vyas Raina*, Adian Liusie*, Mark Gales
This paper investigates the robustness of zero-shot Large Language Models (LLMs) used for assessment tasks, such as evaluating written exams and benchmarking systems. The authors demonstrate that short universal adversarial phrases can be concatenated to input texts to deceive LLMs into predicting inflated scores. They propose a surrogate attack method where an attack phrase is learned on a smaller model and then transferred to larger, unknown judge-LLMs. The study finds that LLMs are more susceptible to these attacks when used for absolute scoring compared to comparative assessment. The results highlight significant vulnerabilities in LLM-as-a-judge methods and emphasize the need for addressing these vulnerabilities before deploying such systems in high-stakes real-world scenarios. The paper also explores initial defense strategies, such as using perplexity scores to detect adversarially manipulated inputs, and suggests that comparative assessment may offer greater robustness against adversarial attacks.This paper investigates the robustness of zero-shot Large Language Models (LLMs) used for assessment tasks, such as evaluating written exams and benchmarking systems. The authors demonstrate that short universal adversarial phrases can be concatenated to input texts to deceive LLMs into predicting inflated scores. They propose a surrogate attack method where an attack phrase is learned on a smaller model and then transferred to larger, unknown judge-LLMs. The study finds that LLMs are more susceptible to these attacks when used for absolute scoring compared to comparative assessment. The results highlight significant vulnerabilities in LLM-as-a-judge methods and emphasize the need for addressing these vulnerabilities before deploying such systems in high-stakes real-world scenarios. The paper also explores initial defense strategies, such as using perplexity scores to detect adversarially manipulated inputs, and suggests that comparative assessment may offer greater robustness against adversarial attacks.