9 May 2024 | Yutao Sun†‡, Li Dong*, Yi Zhu†, Shaohan Huang†, Wenhui Wang†, Shuming Ma†, Quanlu Zhang†, Jianyong Wang‡, Furu Wei†
The paper introduces YOCO, a decoder-decoder architecture designed for large language models that only caches key-value pairs once. YOCO consists of a self-decoder and a cross-decoder, where the self-decoder efficiently encodes global key-value (KV) caches, which are then reused by the cross-decoder via cross-attention. This design reduces GPU memory demands while retaining global attention capability. The architecture enables prefilling to early exit without changing the final output, significantly speeding up the prefill stage. Experimental results show that YOCO achieves favorable performance compared to Transformers in various settings, including scaling up model size and number of training tokens. YOCO also extends to 1M context length with near-perfect needle retrieval accuracy. Profiling results demonstrate that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes.The paper introduces YOCO, a decoder-decoder architecture designed for large language models that only caches key-value pairs once. YOCO consists of a self-decoder and a cross-decoder, where the self-decoder efficiently encodes global key-value (KV) caches, which are then reused by the cross-decoder via cross-attention. This design reduces GPU memory demands while retaining global attention capability. The architecture enables prefilling to early exit without changing the final output, significantly speeding up the prefill stage. Experimental results show that YOCO achieves favorable performance compared to Transformers in various settings, including scaling up model size and number of training tokens. YOCO also extends to 1M context length with near-perfect needle retrieval accuracy. Profiling results demonstrate that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes.