[slides and audio] Stress-Testing Capability Elicitation With Password-Locked Models

The paper "Stress-Testing Capability Elicitation With Password-Locked Models" by Ryan Greenblatt explores methods to robustly elicit hidden capabilities of large language models (LLMs). The authors introduce *password-locked models*, which are LLMs trained to exhibit specific capabilities only when a password is present in the prompt, and to imitate weaker models otherwise. This setup allows for a novel evaluation of capability elicitation methods by testing whether these password-locked capabilities can be elicited without the password. The study focuses on five tasks: code generation, math, multiple-choice question answering, code critiques, and a toy function composition task. The authors find that fine-tuning on a few high-quality demonstrations is often sufficient to fully elicit password-locked capabilities, even when these demonstrations come from a restricted domain. Additionally, fine-tuning can elicit capabilities locked with different passwords or when only low-quality demonstrations are available, provided the quality is high enough. The paper also examines the effectiveness of reinforcement learning (RL) in recovering performance on password-locked tasks, finding that it can be successful when high-quality demonstrations are not available but the model's generations can be judged for correctness. The authors conclude that fine-tuning is an effective method for eliciting hidden capabilities, but its reliability depends on the availability of high-quality demonstrations. Overall, the study provides insights into the limitations and capabilities of fine-tuning-based capability elicitation methods, highlighting the importance of high-quality demonstrations and the potential of RL in scenarios where such demonstrations are lacking.The paper "Stress-Testing Capability Elicitation With Password-Locked Models" by Ryan Greenblatt explores methods to robustly elicit hidden capabilities of large language models (LLMs). The authors introduce *password-locked models*, which are LLMs trained to exhibit specific capabilities only when a password is present in the prompt, and to imitate weaker models otherwise. This setup allows for a novel evaluation of capability elicitation methods by testing whether these password-locked capabilities can be elicited without the password. The study focuses on five tasks: code generation, math, multiple-choice question answering, code critiques, and a toy function composition task. The authors find that fine-tuning on a few high-quality demonstrations is often sufficient to fully elicit password-locked capabilities, even when these demonstrations come from a restricted domain. Additionally, fine-tuning can elicit capabilities locked with different passwords or when only low-quality demonstrations are available, provided the quality is high enough. The paper also examines the effectiveness of reinforcement learning (RL) in recovering performance on password-locked tasks, finding that it can be successful when high-quality demonstrations are not available but the model's generations can be judged for correctness. The authors conclude that fine-tuning is an effective method for eliciting hidden capabilities, but its reliability depends on the availability of high-quality demonstrations. Overall, the study provides insights into the limitations and capabilities of fine-tuning-based capability elicitation methods, highlighting the importance of high-quality demonstrations and the potential of RL in scenarios where such demonstrations are lacking.

Stress-Testing Capability Elicitation With Password-Locked Models

29 May 2024 | Ryan Greenblatt*, Fabien Roger*, Dmitrii Krasheninnikov, David Krueger

29 May 2024 | Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger