Entropy Law: The Story Behind Data Compression and LLM Performance

Entropy Law: The Story Behind Data Compression and LLM Performance

11 Jul 2024 | Mingjia Yin†, Chuhuan Wu†‡, Yufei Wang†, Hao Wang†‡, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang†, Defu Lian, Enhong Chen†
This paper explores the relationship between large language model (LLM) performance and data selection, proposing an "entropy law" that connects LLM performance with data compression ratio and first-epoch training loss. The authors argue that while individual sample quality is crucial, the combinatorial effects among samples are often overlooked. They introduce ZIP, an efficient and universal data selection method that prioritizes data subsets with low compression ratios, aiming to maximize the effective information amount for LLM learning. Through theoretical analysis and empirical experiments, they demonstrate that model performance is negatively correlated with the compression ratio of training data, which typically yields lower training loss. ZIP is evaluated across different LLM backbones and alignment stages, showing superior performance over various quality-based baselines. The paper also presents an application of the entropy law to detect potential performance risks at the beginning of model training, reducing computational overhead in LLM development.This paper explores the relationship between large language model (LLM) performance and data selection, proposing an "entropy law" that connects LLM performance with data compression ratio and first-epoch training loss. The authors argue that while individual sample quality is crucial, the combinatorial effects among samples are often overlooked. They introduce ZIP, an efficient and universal data selection method that prioritizes data subsets with low compression ratios, aiming to maximize the effective information amount for LLM learning. Through theoretical analysis and empirical experiments, they demonstrate that model performance is negatively correlated with the compression ratio of training data, which typically yields lower training loss. ZIP is evaluated across different LLM backbones and alignment stages, showing superior performance over various quality-based baselines. The paper also presents an application of the entropy law to detect potential performance risks at the beginning of model training, reducing computational overhead in LLM development.
Reach us at info@study.space
[slides] Entropy Law%3A The Story Behind Data Compression and LLM Performance | StudySpace