LONGEmbed: Extending Embedding Models for Long Context Retrieval

LONGEmbed: Extending Embedding Models for Long Context Retrieval

25 Apr 2024 | Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
This paper explores extending the context window of existing embedding models to improve long context retrieval. Current embedding models are limited to a narrow context window of up to 8k tokens, restricting their use in scenarios requiring long inputs. The authors introduce the LONGEMBED benchmark, which includes synthetic and real-world tasks to evaluate the performance of embedding models on long context retrieval. They demonstrate that training-free strategies such as position interpolation can significantly extend the context window of existing models without additional training. For models using absolute position encoding (APE), further fine-tuning can enhance performance while preserving behavior for short inputs. For models using rotary position embedding (RoPE), RoPE-specific methods like NTK and SelfExtend yield significant improvements. The authors release E5_Base-4k and E5-RoPE_Base, along with the LONGEMBED benchmark, to facilitate future research. Results show that extending the context window of embedding models can greatly improve their performance on long context retrieval tasks. The study highlights the effectiveness of RoPE-based models in extending context windows compared to APE-based models. The paper also discusses various context window extension strategies, including divide-and-conquer, position reorganization, and position interpolation, and evaluates their effectiveness on different embedding models. The findings suggest that training-free context window extension strategies can significantly enhance the capabilities of existing embedding models for long context retrieval.This paper explores extending the context window of existing embedding models to improve long context retrieval. Current embedding models are limited to a narrow context window of up to 8k tokens, restricting their use in scenarios requiring long inputs. The authors introduce the LONGEMBED benchmark, which includes synthetic and real-world tasks to evaluate the performance of embedding models on long context retrieval. They demonstrate that training-free strategies such as position interpolation can significantly extend the context window of existing models without additional training. For models using absolute position encoding (APE), further fine-tuning can enhance performance while preserving behavior for short inputs. For models using rotary position embedding (RoPE), RoPE-specific methods like NTK and SelfExtend yield significant improvements. The authors release E5_Base-4k and E5-RoPE_Base, along with the LONGEMBED benchmark, to facilitate future research. Results show that extending the context window of embedding models can greatly improve their performance on long context retrieval tasks. The study highlights the effectiveness of RoPE-based models in extending context windows compared to APE-based models. The paper also discusses various context window extension strategies, including divide-and-conquer, position reorganization, and position interpolation, and evaluates their effectiveness on different embedding models. The findings suggest that training-free context window extension strategies can significantly enhance the capabilities of existing embedding models for long context retrieval.
Reach us at info@study.space