Aligner: Efficient Alignment by Learning to Correct

Aligner: Efficient Alignment by Learning to Correct

24 Jun 2024 | Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Tianyi Qiu, Yaodong Yang
Aligner is a novel, efficient alignment method that learns to correct answers by using a small model to identify the difference between preferred and dispreferred responses. It is designed as a model-agnostic, plug-and-play module that can be applied to various open-source and API-based models with minimal training. Aligner can be used to improve the performance of large-scale models and even iteratively enhance them using corrected responses as synthetic data. Experiments show that Aligner significantly improves the 3H (helpfulness, harmlessness, honesty) metrics across multiple models, with Aligner-7B achieving a 68.9% improvement in helpfulness and 23.8% in harmlessness. It also reduces hallucination and enhances the performance of models like GPT-4. Aligner is resource-efficient, requiring fewer training resources than other methods such as DPO and RLHF. It is also interpretable, as it learns residual correction patterns that can be analyzed in the model's layers. Aligner can be used in multi-round reinforcement learning from human feedback (RLHF) pipelines to improve model performance and reduce reward collapse. It is also effective in aligning models with human values and safety standards. The Aligner dataset is released under a CC BY-NC 4.0 license and includes revised answers to meet the 3H standards. The method is model-agnostic, efficient, and can be applied to various models without requiring access to their parameters. It is a promising approach for aligning large language models with human intentions and values.Aligner is a novel, efficient alignment method that learns to correct answers by using a small model to identify the difference between preferred and dispreferred responses. It is designed as a model-agnostic, plug-and-play module that can be applied to various open-source and API-based models with minimal training. Aligner can be used to improve the performance of large-scale models and even iteratively enhance them using corrected responses as synthetic data. Experiments show that Aligner significantly improves the 3H (helpfulness, harmlessness, honesty) metrics across multiple models, with Aligner-7B achieving a 68.9% improvement in helpfulness and 23.8% in harmlessness. It also reduces hallucination and enhances the performance of models like GPT-4. Aligner is resource-efficient, requiring fewer training resources than other methods such as DPO and RLHF. It is also interpretable, as it learns residual correction patterns that can be analyzed in the model's layers. Aligner can be used in multi-round reinforcement learning from human feedback (RLHF) pipelines to improve model performance and reduce reward collapse. It is also effective in aligning models with human values and safety standards. The Aligner dataset is released under a CC BY-NC 4.0 license and includes revised answers to meet the 3H standards. The method is model-agnostic, efficient, and can be applied to various models without requiring access to their parameters. It is a promising approach for aligning large language models with human intentions and values.
Reach us at info@study.space