CEPE (Context Expansion with Parallel Encoding) is a framework that extends the context window of existing decoder-only large language models (LLMs) by adding a small encoder and cross-attention modules. CEPE processes long inputs in chunks, enabling the frozen decoder to utilize additional contexts via cross-attention. It is efficient, generalizable, and versatile, extending the context window of LLAMA-2 to 128K tokens with only 1/6 of the memory and 10× the throughput. CEPE performs well in language modeling, in-context learning, and retrieval-augmented applications. A CEPE variant, CEPED, extends the context window of instruction-tuned models using unlabeled data, leading to a strong instruction-following model that can leverage long contexts on downstream tasks. CEPE is lightweight and applicable to any base or instruction-tuned LLM, offering a cost-effective solution for long-context language modeling. The framework achieves strong performance on long-context language modeling benchmarks, outperforming existing methods in terms of perplexity, memory usage, and throughput. CEPE also excels in retrieval-augmented applications, demonstrating better performance than existing methods. Additionally, CEPE improves performance in open-domain question answering and in-context learning. The framework is efficient, with a significant reduction in memory and computational costs compared to full fine-tuning approaches. CEPE is applicable to a wide range of tasks and can be easily adapted to new applications without explicit fine-tuning. The method is effective in both long-context and retrieval-augmented settings, and it shows promise for future research in long-context language modeling.CEPE (Context Expansion with Parallel Encoding) is a framework that extends the context window of existing decoder-only large language models (LLMs) by adding a small encoder and cross-attention modules. CEPE processes long inputs in chunks, enabling the frozen decoder to utilize additional contexts via cross-attention. It is efficient, generalizable, and versatile, extending the context window of LLAMA-2 to 128K tokens with only 1/6 of the memory and 10× the throughput. CEPE performs well in language modeling, in-context learning, and retrieval-augmented applications. A CEPE variant, CEPED, extends the context window of instruction-tuned models using unlabeled data, leading to a strong instruction-following model that can leverage long contexts on downstream tasks. CEPE is lightweight and applicable to any base or instruction-tuned LLM, offering a cost-effective solution for long-context language modeling. The framework achieves strong performance on long-context language modeling benchmarks, outperforming existing methods in terms of perplexity, memory usage, and throughput. CEPE also excels in retrieval-augmented applications, demonstrating better performance than existing methods. Additionally, CEPE improves performance in open-domain question answering and in-context learning. The framework is efficient, with a significant reduction in memory and computational costs compared to full fine-tuning approaches. CEPE is applicable to a wide range of tasks and can be easily adapted to new applications without explicit fine-tuning. The method is effective in both long-context and retrieval-augmented settings, and it shows promise for future research in long-context language modeling.