The Art of Saying No: Contextual Noncompliance in Language Models
This paper introduces a comprehensive taxonomy of contextual noncompliance for language models, expanding the scope beyond safety concerns to include incomplete, unsupported, indeterminate, and humanizing requests. The authors develop a new evaluation suite of 1000 noncompliance prompts to test models' ability to handle these requests. They find that many existing models, including GPT-4, show high compliance rates in certain categories, such as incomplete and unsupported requests. To address these gaps, they explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Their experiments show that parameter-efficient methods like low-rank adapters help balance appropriate noncompliance with other capabilities.
The paper also introduces CoCoNot, a noncompliance training and evaluation resource. It includes a dataset of noncompliance queries and a contrast query set that should be complied with. The authors evaluate the performance of various models on CoCoNot, finding that larger models and preference-tuned models show lower compliance. They also explore training strategies to improve noncompliance, finding that continued fine-tuning with parameter-efficient methods like LoRA can be effective. Preference tuning on contrast data helps reduce overrefusals.
The paper highlights the importance of noncompliance in chat-based language models, emphasizing the need for models to handle a wide range of requests beyond safety concerns. It also discusses the ethical considerations of training models for noncompliance, noting that while training can mitigate many risks, it does not guarantee 100% safety. The authors conclude that much future research remains to be done in improving user experiences and increasing user trust in language models.The Art of Saying No: Contextual Noncompliance in Language Models
This paper introduces a comprehensive taxonomy of contextual noncompliance for language models, expanding the scope beyond safety concerns to include incomplete, unsupported, indeterminate, and humanizing requests. The authors develop a new evaluation suite of 1000 noncompliance prompts to test models' ability to handle these requests. They find that many existing models, including GPT-4, show high compliance rates in certain categories, such as incomplete and unsupported requests. To address these gaps, they explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Their experiments show that parameter-efficient methods like low-rank adapters help balance appropriate noncompliance with other capabilities.
The paper also introduces CoCoNot, a noncompliance training and evaluation resource. It includes a dataset of noncompliance queries and a contrast query set that should be complied with. The authors evaluate the performance of various models on CoCoNot, finding that larger models and preference-tuned models show lower compliance. They also explore training strategies to improve noncompliance, finding that continued fine-tuning with parameter-efficient methods like LoRA can be effective. Preference tuning on contrast data helps reduce overrefusals.
The paper highlights the importance of noncompliance in chat-based language models, emphasizing the need for models to handle a wide range of requests beyond safety concerns. It also discusses the ethical considerations of training models for noncompliance, noting that while training can mitigate many risks, it does not guarantee 100% safety. The authors conclude that much future research remains to be done in improving user experiences and increasing user trust in language models.