LONGEmbed: Extending Embedding Models for Long Context Retrieval

LONGEmbed: Extending Embedding Models for Long Context Retrieval

25 Apr 2024 | Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li
This paper explores extending the context window of existing embedding models to handle long inputs, a critical requirement for applications such as legal contracts. The authors introduce the LONGEMBED benchmark, which includes synthetic and real-world tasks with varying document lengths and dispersed target information. They find that current embedding models have significant room for improvement on this benchmark. The paper then investigates training-free context window extension strategies, including parallel context windows, position reorganization, and position interpolation. These methods effectively extend the context window of existing models by several folds, regardless of their original context size. For models using absolute position encoding (APE), further fine-tuning can yield additional performance gains while preserving original behavior for short inputs. For models using rotary position encoding (RoPE), significant enhancements are observed with methods like NTK and SelfExtend, indicating RoPE's superiority over APE in context window extension. The paper also releases E5Base-4k and E5-RoPEBase, along with the LONGEMBED benchmark, to facilitate future research. The results highlight the potential of RoPE-based models in handling longer contexts and advocate for their use in future embedding models.This paper explores extending the context window of existing embedding models to handle long inputs, a critical requirement for applications such as legal contracts. The authors introduce the LONGEMBED benchmark, which includes synthetic and real-world tasks with varying document lengths and dispersed target information. They find that current embedding models have significant room for improvement on this benchmark. The paper then investigates training-free context window extension strategies, including parallel context windows, position reorganization, and position interpolation. These methods effectively extend the context window of existing models by several folds, regardless of their original context size. For models using absolute position encoding (APE), further fine-tuning can yield additional performance gains while preserving original behavior for short inputs. For models using rotary position encoding (RoPE), significant enhancements are observed with methods like NTK and SelfExtend, indicating RoPE's superiority over APE in context window extension. The paper also releases E5Base-4k and E5-RoPEBase, along with the LONGEMBED benchmark, to facilitate future research. The results highlight the potential of RoPE-based models in handling longer contexts and advocate for their use in future embedding models.
Reach us at info@study.space
[slides and audio] LongEmbed%3A Extending Embedding Models for Long Context Retrieval