This paper explores the relationship between data compression and large language model (LLM) performance, proposing an "entropy law" that connects LLM performance with data compression ratio and first-epoch training loss. The study highlights that high-quality data samples may not always yield optimal results when combined, due to intrinsic homogeneity or contradictions. The entropy law suggests that model performance is negatively correlated with the compression ratio of training data, which usually results in lower training loss. Based on this law, the authors propose ZIP, an efficient and universal data selection method for training LLMs. ZIP prioritizes data subsets with low compression ratios, aiming to maximize the effective information content for LLM learning. The method uses a multi-stage greedy algorithm to select diverse data samples, ensuring a good balance between data diversity and information density. Extensive experiments validate the entropy law and the effectiveness of ZIP across different LLM backbones and alignment stages. The study also presents an application of the entropy law for detecting potential performance risks during model training. The findings demonstrate that ZIP outperforms existing data selection methods, particularly in terms of efficiency and effectiveness. The entropy law provides a theoretical foundation for understanding the relationship between data compression and LLM performance, offering a new perspective for data selection in LLM development.This paper explores the relationship between data compression and large language model (LLM) performance, proposing an "entropy law" that connects LLM performance with data compression ratio and first-epoch training loss. The study highlights that high-quality data samples may not always yield optimal results when combined, due to intrinsic homogeneity or contradictions. The entropy law suggests that model performance is negatively correlated with the compression ratio of training data, which usually results in lower training loss. Based on this law, the authors propose ZIP, an efficient and universal data selection method for training LLMs. ZIP prioritizes data subsets with low compression ratios, aiming to maximize the effective information content for LLM learning. The method uses a multi-stage greedy algorithm to select diverse data samples, ensuring a good balance between data diversity and information density. Extensive experiments validate the entropy law and the effectiveness of ZIP across different LLM backbones and alignment stages. The study also presents an application of the entropy law for detecting potential performance risks during model training. The findings demonstrate that ZIP outperforms existing data selection methods, particularly in terms of efficiency and effectiveness. The entropy law provides a theoretical foundation for understanding the relationship between data compression and LLM performance, offering a new perspective for data selection in LLM development.