Improving the Robustness of Large Language Models via Consistency Alignment

Improving the Robustness of Large Language Models via Consistency Alignment

22 Mar 2024 | Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Chong Meng, Shuaqiang Wang, Zhicong Cheng, Zhaocun Ren, Dawei Yin
This paper proposes a two-stage training framework to improve the robustness of large language models (LLMs) by addressing the inconsistency problem in their responses. The framework consists of instruction-augmented supervised fine-tuning (SFT) and response consistency alignment training (CAT). In the first stage, the model is trained on paraphrased instructions to enhance its ability to follow instructions. In the second stage, the model is trained to align responses by differentiating subtle differences between similar responses, using self-rewards inferred from the model itself without external human preference data. The training process is conducted on publicly available LLMs, including Vicuna-7B, Vicuna-13B, Llama2-7B, and Llama2-13B, on instruction-following tasks. The results show that the proposed framework significantly improves the robustness and consistency of LLM responses. The method is evaluated using metrics such as consistency rate (CR) and maximum consistency rate (MCR), as well as ROUGE scores. The experiments demonstrate that the framework outperforms standard SFT and other baselines, particularly for Vicuna-13B. The study also highlights the importance of choosing an appropriate base model for further training and the effectiveness of the proposed method in generating more aligned and accurate responses. The paper concludes that the proposed training framework is a promising approach for improving the robustness of LLMs.This paper proposes a two-stage training framework to improve the robustness of large language models (LLMs) by addressing the inconsistency problem in their responses. The framework consists of instruction-augmented supervised fine-tuning (SFT) and response consistency alignment training (CAT). In the first stage, the model is trained on paraphrased instructions to enhance its ability to follow instructions. In the second stage, the model is trained to align responses by differentiating subtle differences between similar responses, using self-rewards inferred from the model itself without external human preference data. The training process is conducted on publicly available LLMs, including Vicuna-7B, Vicuna-13B, Llama2-7B, and Llama2-13B, on instruction-following tasks. The results show that the proposed framework significantly improves the robustness and consistency of LLM responses. The method is evaluated using metrics such as consistency rate (CR) and maximum consistency rate (MCR), as well as ROUGE scores. The experiments demonstrate that the framework outperforms standard SFT and other baselines, particularly for Vicuna-13B. The study also highlights the importance of choosing an appropriate base model for further training and the effectiveness of the proposed method in generating more aligned and accurate responses. The paper concludes that the proposed training framework is a promising approach for improving the robustness of LLMs.
Reach us at info@study.space