Understanding MAP-Neo%3A Highly Capable and Transparent Bilingual Large Language Model Series

MAP-Neo is a highly capable and transparent bilingual large language model (LLM) with 7 billion parameters, trained from scratch on 4.5 trillion high-quality tokens. It is the first fully open-sourced bilingual LLM that matches the performance of existing state-of-the-art LLMs. The research team has open-sourced all details, including the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and a well-optimized training/evaluation framework. MAP-Neo aims to enhance the open research community and inspire further innovations in LLMs. The paper discusses the importance of open-source and transparent LLMs for democratizing access and advancing academic research. It highlights the limitations of existing open-source LLMs, such as OLMo, which are less proficient in areas like coding, reasoning, and knowledge. MAP-Neo addresses these gaps by providing a comprehensive handbook for building LLMs from scratch, covering the entire workflow from data curation to model training. The paper also details the Matrix data pile, a bilingual pre-training corpus of 4.5 trillion tokens, which is the largest transparent LLM pre-training corpus to date. The corpus is constructed through a re-processing pipeline for open datasets, a crawl from scratch pipeline for Chinese content, and a document conversion pipeline for high-quality supplement data collection. The MAP-Neo model architecture is based on the transformer decoder with enhancements like multi-query attention, RoPE embeddings, RMSNorm, and SwiGLU activation function. The pre-training process involves a two-stage strategy: a fundamental phase for general text generation and a decay phase for improving reliability with high-quality data and code data. The alignment process includes supervised fine-tuning and iterative DPO (Direct Preference Optimization) to align the model with human preferences. The paper also presents the scaling law of MAP-Neo, which predicts training configurations based on the ratio between training data and model size. Overall, MAP-Neo represents a significant advancement in the field of open-source LLMs, offering superior performance and transparency, and providing a valuable resource for future research and development.MAP-Neo is a highly capable and transparent bilingual large language model (LLM) with 7 billion parameters, trained from scratch on 4.5 trillion high-quality tokens. It is the first fully open-sourced bilingual LLM that matches the performance of existing state-of-the-art LLMs. The research team has open-sourced all details, including the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and a well-optimized training/evaluation framework. MAP-Neo aims to enhance the open research community and inspire further innovations in LLMs. The paper discusses the importance of open-source and transparent LLMs for democratizing access and advancing academic research. It highlights the limitations of existing open-source LLMs, such as OLMo, which are less proficient in areas like coding, reasoning, and knowledge. MAP-Neo addresses these gaps by providing a comprehensive handbook for building LLMs from scratch, covering the entire workflow from data curation to model training. The paper also details the Matrix data pile, a bilingual pre-training corpus of 4.5 trillion tokens, which is the largest transparent LLM pre-training corpus to date. The corpus is constructed through a re-processing pipeline for open datasets, a crawl from scratch pipeline for Chinese content, and a document conversion pipeline for high-quality supplement data collection. The MAP-Neo model architecture is based on the transformer decoder with enhancements like multi-query attention, RoPE embeddings, RMSNorm, and SwiGLU activation function. The pre-training process involves a two-stage strategy: a fundamental phase for general text generation and a decay phase for improving reliability with high-quality data and code data. The alignment process includes supervised fine-tuning and iterative DPO (Direct Preference Optimization) to align the model with human preferences. The paper also presents the scaling law of MAP-Neo, which predicts training configurations based on the ratio between training data and model size. Overall, MAP-Neo represents a significant advancement in the field of open-source LLMs, offering superior performance and transparency, and providing a valuable resource for future research and development.

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

10 Jul 2024 | M-A-P, University of Waterloo, Wuhan AI Research, 01.AI