CAN PERPLEXITY REFLECT LARGE LANGUAGE MODEL'S ABILITY IN LONG TEXT UNDERSTANDING?

CAN PERPLEXITY REFLECT LARGE LANGUAGE MODEL'S ABILITY IN LONG TEXT UNDERSTANDING?

2024 | Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, Yansong Feng
Can perplexity (PPL) reflect a large language model's (LLM) ability in long-text understanding? This study investigates whether PPL, a common metric for language modeling, accurately reflects LLMs' long-text understanding ability. The research finds that PPL may only reflect the model's ability to model local information, not its ability to understand long text. Experiments on three LLM variants with long context windows show that models with lower PPL do not necessarily perform better in downstream tasks like question answering and summarization. For example, YARN, which has the lowest PPL, does not outperform LongLoRA in downstream tasks. This suggests that PPL is not a good indicator of long-text understanding ability. The study also shows that LLaMA2, which has a short context window, can achieve a low PPL on long text, indicating that models without long-text understanding ability can still achieve low PPL. This implies that PPL may only reflect the model's ability to model local information, not its ability to understand long text. The study also explains why position embedding methods like ALiBi can enable models to extrapolate to longer inference sequences while keeping PPL low, as they focus on local information. The conclusion is that PPL can be an effective evaluation metric for long-text language modeling ability, but not for long-text understanding. Therefore, more diversified evaluation metrics are needed for long text processing ability.Can perplexity (PPL) reflect a large language model's (LLM) ability in long-text understanding? This study investigates whether PPL, a common metric for language modeling, accurately reflects LLMs' long-text understanding ability. The research finds that PPL may only reflect the model's ability to model local information, not its ability to understand long text. Experiments on three LLM variants with long context windows show that models with lower PPL do not necessarily perform better in downstream tasks like question answering and summarization. For example, YARN, which has the lowest PPL, does not outperform LongLoRA in downstream tasks. This suggests that PPL is not a good indicator of long-text understanding ability. The study also shows that LLaMA2, which has a short context window, can achieve a low PPL on long text, indicating that models without long-text understanding ability can still achieve low PPL. This implies that PPL may only reflect the model's ability to model local information, not its ability to understand long text. The study also explains why position embedding methods like ALiBi can enable models to extrapolate to longer inference sequences while keeping PPL low, as they focus on local information. The conclusion is that PPL can be an effective evaluation metric for long-text language modeling ability, but not for long-text understanding. Therefore, more diversified evaluation metrics are needed for long text processing ability.
Reach us at info@study.space