16 Apr 2024 | Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou
MEGALODON is an advanced neural architecture designed for efficient sequence modeling with unlimited context length. It builds upon the MEGA architecture, which incorporates gated attention and exponential moving average (EMA) techniques. MEGALODON introduces several novel components to enhance its capability and stability, including complex exponential moving average (CEMA), timestep normalization, normalized attention, and pre-norm with two-hop residual configuration. These components address the quadratic complexity and limited length extrapolation issues of Transformers, making MEGALODON more efficient and scalable for long sequence modeling.
In a head-to-head comparison with Llama2, MEGALODON achieves better efficiency in terms of training loss and downstream task performance for models with 7 billion parameters and 2 trillion training tokens. MEGALODON demonstrates superior performance in various benchmarks across different tasks and modalities, including long-context modeling, long-context QA tasks, and instruction fine-tuning. The architecture's ability to model sequences of unlimited length is validated through experiments on long sequences and open-book question answering datasets.
The paper also discusses the architectural details, training methods, and experimental results, highlighting MEGALODON's potential for large-scale multi-modality pretraining.MEGALODON is an advanced neural architecture designed for efficient sequence modeling with unlimited context length. It builds upon the MEGA architecture, which incorporates gated attention and exponential moving average (EMA) techniques. MEGALODON introduces several novel components to enhance its capability and stability, including complex exponential moving average (CEMA), timestep normalization, normalized attention, and pre-norm with two-hop residual configuration. These components address the quadratic complexity and limited length extrapolation issues of Transformers, making MEGALODON more efficient and scalable for long sequence modeling.
In a head-to-head comparison with Llama2, MEGALODON achieves better efficiency in terms of training loss and downstream task performance for models with 7 billion parameters and 2 trillion training tokens. MEGALODON demonstrates superior performance in various benchmarks across different tasks and modalities, including long-context modeling, long-context QA tasks, and instruction fine-tuning. The architecture's ability to model sequences of unlimited length is validated through experiments on long sequences and open-book question answering datasets.
The paper also discusses the architectural details, training methods, and experimental results, highlighting MEGALODON's potential for large-scale multi-modality pretraining.