16 Apr 2024 | Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou
MEGALODON is an efficient large language model (LLM) architecture that enables unlimited context length for sequence modeling. It builds upon the MEGA architecture, which uses exponential moving average (EMA) with gated attention. MEGALODON introduces several technical components to enhance its performance and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism, and pre-norm with two-hop residual configuration. These components improve the model's efficiency and scalability, allowing it to handle long sequences more effectively than traditional Transformer-based models.
In a head-to-head comparison with LLAMA2, MEGALODON achieves better efficiency in the scale of 7 billion parameters and 2 trillion training tokens. It reaches a training loss of 1.70, which is mid-way between LLAMA2-7B (1.75) and 13B (1.67). MEGALODON demonstrates robust improvements across various benchmarks and tasks, including long-context modeling, long-context QA tasks, and short-context academic benchmarks. It outperforms LLAMA2-7B on training perplexity and across downstream benchmarks, and achieves competitive results with LLAMA2-13B on some tasks.
MEGALODON also shows strong performance on medium-scale benchmarks, including image classification on ImageNet-1K and auto-regressive language modeling on PG-19. It significantly outperforms previous state-of-the-art models on these tasks. Additionally, MEGALODON exhibits superior performance on instruction following and alignment tasks, outperforming Vicuna on the MT-Bench benchmark.
MEGALODON's architecture allows for efficient distributed training, enabling it to scale to large-scale multi-modality pretraining. The model's ability to handle long sequences and its efficiency in both training and inference make it a promising direction for future research in large language modeling.MEGALODON is an efficient large language model (LLM) architecture that enables unlimited context length for sequence modeling. It builds upon the MEGA architecture, which uses exponential moving average (EMA) with gated attention. MEGALODON introduces several technical components to enhance its performance and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism, and pre-norm with two-hop residual configuration. These components improve the model's efficiency and scalability, allowing it to handle long sequences more effectively than traditional Transformer-based models.
In a head-to-head comparison with LLAMA2, MEGALODON achieves better efficiency in the scale of 7 billion parameters and 2 trillion training tokens. It reaches a training loss of 1.70, which is mid-way between LLAMA2-7B (1.75) and 13B (1.67). MEGALODON demonstrates robust improvements across various benchmarks and tasks, including long-context modeling, long-context QA tasks, and short-context academic benchmarks. It outperforms LLAMA2-7B on training perplexity and across downstream benchmarks, and achieves competitive results with LLAMA2-13B on some tasks.
MEGALODON also shows strong performance on medium-scale benchmarks, including image classification on ImageNet-1K and auto-regressive language modeling on PG-19. It significantly outperforms previous state-of-the-art models on these tasks. Additionally, MEGALODON exhibits superior performance on instruction following and alignment tasks, outperforming Vicuna on the MT-Bench benchmark.
MEGALODON's architecture allows for efficient distributed training, enabling it to scale to large-scale multi-modality pretraining. The model's ability to handle long sequences and its efficiency in both training and inference make it a promising direction for future research in large language modeling.