16 Apr 2024 | Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin
The paper introduces *Hierarchical Context MERging (HOMER)*, a novel training-free method to overcome the context limit of large language models (LLMs). HOMER employs a divide-and-conquer approach, dividing long inputs into manageable chunks and progressively merging adjacent chunks at transformer layers. Each chunk is processed collectively, and token reduction techniques are applied to ensure memory efficiency. The method reduces memory requirements to scale logarithmically with input length, making it suitable for environments with limited memory. Experiments demonstrate HOMER's superior performance and memory efficiency, enabling LLMs to handle extended contexts in various tasks, including passkey retrieval, question answering, and language modeling. HOMER can be integrated with conventional positional encoding scaling methods, further improving performance. The method is orthogonal to existing approaches and can be applied without additional training, making it practical for real-world applications.The paper introduces *Hierarchical Context MERging (HOMER)*, a novel training-free method to overcome the context limit of large language models (LLMs). HOMER employs a divide-and-conquer approach, dividing long inputs into manageable chunks and progressively merging adjacent chunks at transformer layers. Each chunk is processed collectively, and token reduction techniques are applied to ensure memory efficiency. The method reduces memory requirements to scale logarithmically with input length, making it suitable for environments with limited memory. Experiments demonstrate HOMER's superior performance and memory efficiency, enabling LLMs to handle extended contexts in various tasks, including passkey retrieval, question answering, and language modeling. HOMER can be integrated with conventional positional encoding scaling methods, further improving performance. The method is orthogonal to existing approaches and can be applied without additional training, making it practical for real-world applications.