Does Refusal Training in LLMs Generalize to the Past Tense?

Does Refusal Training in LLMs Generalize to the Past Tense?

2024 | Maksym Andriushchenko, Nicolas Flammarion
This study investigates whether refusal training in large language models (LLMs) generalizes to past tense requests. The authors find that reformulating harmful requests in the past tense often bypasses the refusal training of state-of-the-art LLMs, such as GPT-4o, with success rates as high as 88%. In contrast, future tense reformulations are less effective, suggesting that refusal guardrails may perceive past historical questions as more benign than hypothetical future ones. The study also shows that fine-tuning GPT-3.5 Turbo with past tense examples can improve the model's ability to refuse harmful requests. However, over-refusal must be carefully controlled. The findings highlight that widely used alignment techniques like SFT, RLHF, and adversarial training may be brittle and fail to generalize as intended. The study provides code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense.This study investigates whether refusal training in large language models (LLMs) generalizes to past tense requests. The authors find that reformulating harmful requests in the past tense often bypasses the refusal training of state-of-the-art LLMs, such as GPT-4o, with success rates as high as 88%. In contrast, future tense reformulations are less effective, suggesting that refusal guardrails may perceive past historical questions as more benign than hypothetical future ones. The study also shows that fine-tuning GPT-3.5 Turbo with past tense examples can improve the model's ability to refuse harmful requests. However, over-refusal must be carefully controlled. The findings highlight that widely used alignment techniques like SFT, RLHF, and adversarial training may be brittle and fail to generalize as intended. The study provides code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense.
Reach us at info@study.space
Understanding Does Refusal Training in LLMs Generalize to the Past Tense%3F