29 Feb 2024 | Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao
This paper investigates the trustworthiness dynamics of large language models (LLMs) during their pre-training phase, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. The study pioneers the exploration of LLM trustworthiness during pre-training, revealing the untapped potential of this period. By applying linear probing, the researchers find that early pre-training checkpoints already exhibit linearly separable patterns for trustworthiness concepts. They further extract steering vectors from pre-training checkpoints to enhance LLM trustworthiness and use mutual information to investigate trustworthiness dynamics during pre-training, observing a two-phase trend: fitting and compression, similar to findings in traditional DNNs.
The study also explores the use of activation intervention techniques to improve LLM trustworthiness. By extracting steering vectors from pre-training checkpoints, the researchers demonstrate that these vectors can significantly enhance the trustworthiness of the SFT model (AmberChat) in various dimensions. The results show that steering vectors derived from pre-training checkpoints can achieve performance comparable to or better than those derived directly from the SFT model itself. However, the study also highlights trade-offs between different trustworthiness dimensions, such as truthfulness and fairness.
The research provides insights into the dynamics of LLM pre-training, revealing that trustworthiness concepts are linearly represented in the latent space of LLMs. This supports the linear representation hypothesis and other empirical studies. The findings suggest that pre-training checkpoints hold significant untapped potential for enhancing LLM trustworthiness. The study also demonstrates that mutual information estimation is bounded by linear probing accuracy, revealing a phase transition from fitting to compression during pre-training.
The paper concludes that the pre-training period of LLMs is crucial for understanding and improving their trustworthiness. The research provides a foundation for further exploration into the dynamics of LLM pre-training and offers new insights into enhancing LLM trustworthiness. The code and datasets used in this study are publicly available for further research.This paper investigates the trustworthiness dynamics of large language models (LLMs) during their pre-training phase, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. The study pioneers the exploration of LLM trustworthiness during pre-training, revealing the untapped potential of this period. By applying linear probing, the researchers find that early pre-training checkpoints already exhibit linearly separable patterns for trustworthiness concepts. They further extract steering vectors from pre-training checkpoints to enhance LLM trustworthiness and use mutual information to investigate trustworthiness dynamics during pre-training, observing a two-phase trend: fitting and compression, similar to findings in traditional DNNs.
The study also explores the use of activation intervention techniques to improve LLM trustworthiness. By extracting steering vectors from pre-training checkpoints, the researchers demonstrate that these vectors can significantly enhance the trustworthiness of the SFT model (AmberChat) in various dimensions. The results show that steering vectors derived from pre-training checkpoints can achieve performance comparable to or better than those derived directly from the SFT model itself. However, the study also highlights trade-offs between different trustworthiness dimensions, such as truthfulness and fairness.
The research provides insights into the dynamics of LLM pre-training, revealing that trustworthiness concepts are linearly represented in the latent space of LLMs. This supports the linear representation hypothesis and other empirical studies. The findings suggest that pre-training checkpoints hold significant untapped potential for enhancing LLM trustworthiness. The study also demonstrates that mutual information estimation is bounded by linear probing accuracy, revealing a phase transition from fitting to compression during pre-training.
The paper concludes that the pre-training period of LLMs is crucial for understanding and improving their trustworthiness. The research provides a foundation for further exploration into the dynamics of LLM pre-training and offers new insights into enhancing LLM trustworthiness. The code and datasets used in this study are publicly available for further research.