Understanding Towards Tracing Trustworthiness Dynamics%3A Revisiting Pre-training Period of Large Language Models

This paper explores the trustworthiness dynamics of large language models (LLMs) during their pre-training period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. The authors use linear probing to analyze the representations of LLMs at 360 pre-training checkpoints, finding that early pre-training already yields high probing accuracy, indicating that LLMs can distinguish concepts in each trustworthiness dimension. They then extract steering vectors from these checkpoints to enhance the LLMs' trustworthiness through activation intervention techniques. The results show that steering vectors derived from pre-training checkpoints can significantly improve the LLMs' performance in various trustworthiness dimensions, sometimes even outperforming those from fully fine-tuned models. Additionally, the authors investigate the dynamics of mutual information during pre-training, observing a two-phase phenomenon: fitting and compression, similar to traditional DNNs. This research provides insights into leveraging pre-training checkpoints to enhance LLM trustworthiness and offers a new perspective on activation engineering.This paper explores the trustworthiness dynamics of large language models (LLMs) during their pre-training period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. The authors use linear probing to analyze the representations of LLMs at 360 pre-training checkpoints, finding that early pre-training already yields high probing accuracy, indicating that LLMs can distinguish concepts in each trustworthiness dimension. They then extract steering vectors from these checkpoints to enhance the LLMs' trustworthiness through activation intervention techniques. The results show that steering vectors derived from pre-training checkpoints can significantly improve the LLMs' performance in various trustworthiness dimensions, sometimes even outperforming those from fully fine-tuned models. Additionally, the authors investigate the dynamics of mutual information during pre-training, observing a two-phase phenomenon: fitting and compression, similar to traditional DNNs. This research provides insights into leveraging pre-training checkpoints to enhance LLM trustworthiness and offers a new perspective on activation engineering.

Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models

29 Feb 2024 | Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao