Nemotron-4 15B Technical Report

Nemotron-4 15B Technical Report

27 Feb 2024 | Jupinder Parmar*, Shrimai Prabhumoye*, Joseph Jennings*, Mostofa Patwary*, Sandeep Subramanian†, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, Bryan Catanzaro
The paper introduces Nemotron-4 15B, a large multilingual language model trained on 8 trillion text tokens. The model demonstrates strong performance across various tasks, including English, multilingual, and coding tasks. It outperforms existing similarly-sized open models in 4 out of 7 downstream evaluation areas and achieves competitive performance in the remaining areas. Specifically, Nemotron-4 15B excels in multilingual capabilities, outperforming models over four times larger and those specialized for multilingual tasks. The model is trained using a combination of English, multilingual, and code data, leveraging a decoder-only Transformer architecture with causal attention masks. It was trained on 384 DGX H100 nodes, using a combination of tensor parallelism and data parallelism. The pre-training dataset includes 70% English natural language data, 15% multilingual natural language data, and 15% source-code data. The paper also details the model's architecture, training process, and evaluation results, highlighting its superior performance in various benchmarks and tasks.The paper introduces Nemotron-4 15B, a large multilingual language model trained on 8 trillion text tokens. The model demonstrates strong performance across various tasks, including English, multilingual, and coding tasks. It outperforms existing similarly-sized open models in 4 out of 7 downstream evaluation areas and achieves competitive performance in the remaining areas. Specifically, Nemotron-4 15B excels in multilingual capabilities, outperforming models over four times larger and those specialized for multilingual tasks. The model is trained using a combination of English, multilingual, and code data, leveraging a decoder-only Transformer architecture with causal attention masks. It was trained on 384 DGX H100 nodes, using a combination of tensor parallelism and data parallelism. The pre-training dataset includes 70% English natural language data, 15% multilingual natural language data, and 15% source-code data. The paper also details the model's architecture, training process, and evaluation results, highlighting its superior performance in various benchmarks and tasks.
Reach us at info@study.space
[slides] Nemotron-4 15B Technical Report | StudySpace