Improving the Robustness of Large Language Models via Consistency Alignment

Improving the Robustness of Large Language Models via Consistency Alignment

22 Mar 2024 | Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Chong Meng, Shuaiqiang Wang, Zhicong Cheng, Zhaochun Ren, Dawei Yin
This paper addresses the issue of inconsistency in responses generated by large language models (LLMs) due to minor changes in verbalized instructions. The authors propose a two-stage training framework to improve the robustness of LLMs in following instructions. The first stage involves instruction-augmented supervised fine-tuning, where similar instructions are generated to help the model generalize better. The second stage focuses on consistency alignment training, which helps the model understand and generate responses that align more closely with human expectations by differentiating subtle differences in similar responses. The training process uses self-rewards inferred from the trained model, without relying on external human preference resources. Extensive experiments on publicly available LLMs, including Vicuna-7B, Vicuna-13B, Llama2-7B, and Llama2-13B, demonstrate the effectiveness of the proposed framework in improving the robustness and accuracy of LLMs in instruction-following tasks. The authors also conduct human evaluations to validate the improvements in response quality and consistency.This paper addresses the issue of inconsistency in responses generated by large language models (LLMs) due to minor changes in verbalized instructions. The authors propose a two-stage training framework to improve the robustness of LLMs in following instructions. The first stage involves instruction-augmented supervised fine-tuning, where similar instructions are generated to help the model generalize better. The second stage focuses on consistency alignment training, which helps the model understand and generate responses that align more closely with human expectations by differentiating subtle differences in similar responses. The training process uses self-rewards inferred from the trained model, without relying on external human preference resources. Extensive experiments on publicly available LLMs, including Vicuna-7B, Vicuna-13B, Llama2-7B, and Llama2-13B, demonstrate the effectiveness of the proposed framework in improving the robustness and accuracy of LLMs in instruction-following tasks. The authors also conduct human evaluations to validate the improvements in response quality and consistency.
Reach us at info@study.space