Stress-Testing Capability Elicitation With Password-Locked Models

Stress-Testing Capability Elicitation With Password-Locked Models

29 May 2024 | Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger
This paper introduces password-locked models, which are large language models (LLMs) fine-tuned to exhibit certain capabilities only when a specific password is present in the prompt, and otherwise imitate a weaker LLM. The goal is to evaluate the effectiveness of different methods for eliciting hidden capabilities of LLMs. The authors find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. Moreover, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. When only evaluations are available, approaches like reinforcement learning (RL) are still often able to elicit capabilities. The study shows that fine-tuning is an effective method for eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available. The paper also discusses the limitations of password-locked models, including the fact that they may not provide a useful indication of the harms an LLM might cause with other post-training enhancements. Overall, the findings suggest that password-locked models can be a useful tool for studying the elicitation of hidden capabilities in LLMs.This paper introduces password-locked models, which are large language models (LLMs) fine-tuned to exhibit certain capabilities only when a specific password is present in the prompt, and otherwise imitate a weaker LLM. The goal is to evaluate the effectiveness of different methods for eliciting hidden capabilities of LLMs. The authors find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. Moreover, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. When only evaluations are available, approaches like reinforcement learning (RL) are still often able to elicit capabilities. The study shows that fine-tuning is an effective method for eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available. The paper also discusses the limitations of password-locked models, including the fact that they may not provide a useful indication of the harms an LLM might cause with other post-training enhancements. Overall, the findings suggest that password-locked models can be a useful tool for studying the elicitation of hidden capabilities in LLMs.
Reach us at info@study.space
[slides and audio] Stress-Testing Capability Elicitation With Password-Locked Models