The paper "Can Perplexity Reflect Large Language Model’s Ability in Long Text Understanding?" by Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, and Yansong Feng explores the relationship between perplexity (PPL) and the long-text understanding ability of Large Language Models (LLMs). The authors find that PPL, a common metric for evaluating LLMs' performance in language modeling, does not correlate with their ability to understand long texts. They argue that PPL primarily reflects the model's ability to model local information rather than capturing long-range dependencies. This conclusion is supported by experiments on three LLM variants, where models with lower PPL in language modeling did not perform well on downstream tasks such as question answering and document summarization. The authors also demonstrate that models like LLaMA2, which can only handle short context windows, can achieve low PPL by focusing on local information, further validating their hypothesis. The paper concludes that PPL is an effective metric for language modeling but should not be used to assess long-text understanding capabilities, and calls for more diverse evaluation metrics to better evaluate long-text processing abilities.The paper "Can Perplexity Reflect Large Language Model’s Ability in Long Text Understanding?" by Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, and Yansong Feng explores the relationship between perplexity (PPL) and the long-text understanding ability of Large Language Models (LLMs). The authors find that PPL, a common metric for evaluating LLMs' performance in language modeling, does not correlate with their ability to understand long texts. They argue that PPL primarily reflects the model's ability to model local information rather than capturing long-range dependencies. This conclusion is supported by experiments on three LLM variants, where models with lower PPL in language modeling did not perform well on downstream tasks such as question answering and document summarization. The authors also demonstrate that models like LLaMA2, which can only handle short context windows, can achieve low PPL by focusing on local information, further validating their hypothesis. The paper concludes that PPL is an effective metric for language modeling but should not be used to assess long-text understanding capabilities, and calls for more diverse evaluation metrics to better evaluate long-text processing abilities.