MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

10 Jul 2024 | M-A-P, University of Waterloo, Wuhan AI Research, 01.AI
MAP-Neo is a fully open-sourced bilingual large language model (LLM) with 7B parameters, trained on 4.5T high-quality tokens. It is the first bilingual LLM that achieves performance comparable to existing state-of-the-art closed-source models. The model is transparent, with all training details, pre-training corpus, data cleaning pipeline, checkpoints, and training/evaluation framework made publicly available. This transparency allows for reproducibility and further research. MAP-Neo is designed to enhance the open-source research community and inspire innovation in LLM development. The model is built using a comprehensive pipeline that includes data curation, training, alignment, and evaluation. The data curation process involves cleaning and processing training data, including the use of a stable OCR system and data recall mechanisms. The pre-training corpus, known as Matrix Data Pile, is a bilingual dataset of 4.5T tokens, sourced from diverse corpora and processed through multiple stages of filtering, deduplication, and conversion. The data includes a mix of Chinese and English texts, with a focus on high-quality content. The model architecture is based on the transformer decoder, with enhancements such as multi-query attention, rotary positional embeddings, and RMSNorm. The model is trained in two phases: the fundamental phase, which focuses on general text generation, and the decay phase, which enhances the model's reliability and performance on code generation tasks. The training process involves a two-stage learning rate scheduler, with the decay phase using exponential decay to improve code generation. The model is aligned through supervised fine-tuning (SFT) and iterative DPO. SFT enhances the model's instruction-following and chat abilities, while DPO aligns the model with human preferences. The model is evaluated on various benchmarks, including Chinese and English understanding, mathematical ability, and code ability. MAP-Neo demonstrates superior performance across these benchmarks, achieving high scores and demonstrating the effectiveness of its training and data quality. The model's transparency and open-source nature make it a valuable resource for the research community, providing a comprehensive framework for building and improving LLMs. By making all training details and data publicly available, MAP-Neo encourages further research and innovation in the field of large language models.MAP-Neo is a fully open-sourced bilingual large language model (LLM) with 7B parameters, trained on 4.5T high-quality tokens. It is the first bilingual LLM that achieves performance comparable to existing state-of-the-art closed-source models. The model is transparent, with all training details, pre-training corpus, data cleaning pipeline, checkpoints, and training/evaluation framework made publicly available. This transparency allows for reproducibility and further research. MAP-Neo is designed to enhance the open-source research community and inspire innovation in LLM development. The model is built using a comprehensive pipeline that includes data curation, training, alignment, and evaluation. The data curation process involves cleaning and processing training data, including the use of a stable OCR system and data recall mechanisms. The pre-training corpus, known as Matrix Data Pile, is a bilingual dataset of 4.5T tokens, sourced from diverse corpora and processed through multiple stages of filtering, deduplication, and conversion. The data includes a mix of Chinese and English texts, with a focus on high-quality content. The model architecture is based on the transformer decoder, with enhancements such as multi-query attention, rotary positional embeddings, and RMSNorm. The model is trained in two phases: the fundamental phase, which focuses on general text generation, and the decay phase, which enhances the model's reliability and performance on code generation tasks. The training process involves a two-stage learning rate scheduler, with the decay phase using exponential decay to improve code generation. The model is aligned through supervised fine-tuning (SFT) and iterative DPO. SFT enhances the model's instruction-following and chat abilities, while DPO aligns the model with human preferences. The model is evaluated on various benchmarks, including Chinese and English understanding, mathematical ability, and code ability. MAP-Neo demonstrates superior performance across these benchmarks, achieving high scores and demonstrating the effectiveness of its training and data quality. The model's transparency and open-source nature make it a valuable resource for the research community, providing a comprehensive framework for building and improving LLMs. By making all training details and data publicly available, MAP-Neo encourages further research and innovation in the field of large language models.
Reach us at info@study.space
[slides] MAP-Neo%3A Highly Capable and Transparent Bilingual Large Language Model Series | StudySpace