[slides] Does Refusal Training in LLMs Generalize to the Past Tense%3F

The paper "Does Refusal Training in LLMs Generalize to the Past Tense?" by Maksym Andriushchenko and Nicolas Flammarion explores the effectiveness of refusal training in large language models (LLMs) against harmful requests. The authors find that simply reformulating a harmful request from the present tense to the past tense is often sufficient to bypass the refusal training mechanisms of state-of-the-art LLMs. They systematically evaluate this method on various models, including Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2, using GPT-3.5 Turbo as a reformulation model. The results show that the success rate of this attack increases significantly with multiple reformulation attempts, with GPT-4o achieving an 88% success rate on harmful requests from JailbreakBench. The authors also observe that reformulations in the future tense are less effective, suggesting that models tend to view past historical questions as more benign than hypothetical future questions. Additionally, they demonstrate that defending against past tense reformulations is feasible by explicitly including past tense examples in the fine-tuning data, although overrefusals must be carefully controlled. The paper highlights the limitations of current alignment techniques such as supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and adversarial training, which can be brittle and do not always generalize as intended. The findings underscore the need for more robust methods to ensure that LLMs can effectively refuse harmful outputs across different tenses and contexts.The paper "Does Refusal Training in LLMs Generalize to the Past Tense?" by Maksym Andriushchenko and Nicolas Flammarion explores the effectiveness of refusal training in large language models (LLMs) against harmful requests. The authors find that simply reformulating a harmful request from the present tense to the past tense is often sufficient to bypass the refusal training mechanisms of state-of-the-art LLMs. They systematically evaluate this method on various models, including Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2, using GPT-3.5 Turbo as a reformulation model. The results show that the success rate of this attack increases significantly with multiple reformulation attempts, with GPT-4o achieving an 88% success rate on harmful requests from JailbreakBench. The authors also observe that reformulations in the future tense are less effective, suggesting that models tend to view past historical questions as more benign than hypothetical future questions. Additionally, they demonstrate that defending against past tense reformulations is feasible by explicitly including past tense examples in the fine-tuning data, although overrefusals must be carefully controlled. The paper highlights the limitations of current alignment techniques such as supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and adversarial training, which can be brittle and do not always generalize as intended. The findings underscore the need for more robust methods to ensure that LLMs can effectively refuse harmful outputs across different tenses and contexts.

Does Refusal Training in LLMs Generalize to the Past Tense?

19 Jul 2024 | Maksym Andriushchenko, Nicolas Flammarion