Code Llama: Open Foundation Models for Code

Code Llama: Open Foundation Models for Code

31 Jan 2024 | Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve
Code Llama is a family of large language models for code based on LLAMA 2, offering state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. The models include foundation models (Code Llama), Python specializations (Code Llama - PYTHON), and instruction-following models (Code Llama - INSTRUCT) with 7B, 13B, 34B, and 70B parameters. These models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. The 7B, 13B, and 70B Code Llama and Code Llama - INSTRUCT variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - PYTHON 7B outperforms LLAMA 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. Code Llama is released under a permissive license that allows for both research and commercial use. The models are trained on a near-deduplicated dataset of publicly available code and natural language datasets related to code. They are trained using an infilling objective, enabling applications such as real-time completion in source code editors or docstring generation. The models are also fine-tuned to handle long contexts, extending the maximum context length from 4,096 tokens to 100,000 tokens. Instruction fine-tuning is used to improve safety and helpfulness, with CODE Llama - INSTRUCT variants further fine-tuned on a mix of proprietary instruction data and a new machine-generated self-instruct dataset. The models show significant improvements in performance on various truthfulness, toxicity, and bias benchmarks. The models are evaluated on major code generation benchmarks such as HumanEval, MBPP, and APPS, as well as a multilingual version of HumanEval (MultiPL-E). The results show that CODE Llama models establish a new state of the art amongst open-source LLMs. The technical details of the training and fine-tuning procedures are provided, along with in-depth experiments and ablation studies, details of the safety/helpfulness evaluations, and a discussion of related work. The models are also evaluated on infilling benchmarks and long context evaluations, showing significant improvements in performance on these tasks. The models are also evaluated on safety and coding performance, with CODE Llama - INSTRUCT models showing significant improvements in truthfulness and toxicity. The models are also evaluated on red teaming exercises, showing that they are able to resist malicious prompts and provide safer responses. The models are also evaluated on false refusals, showing thatCode Llama is a family of large language models for code based on LLAMA 2, offering state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. The models include foundation models (Code Llama), Python specializations (Code Llama - PYTHON), and instruction-following models (Code Llama - INSTRUCT) with 7B, 13B, 34B, and 70B parameters. These models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. The 7B, 13B, and 70B Code Llama and Code Llama - INSTRUCT variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - PYTHON 7B outperforms LLAMA 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. Code Llama is released under a permissive license that allows for both research and commercial use. The models are trained on a near-deduplicated dataset of publicly available code and natural language datasets related to code. They are trained using an infilling objective, enabling applications such as real-time completion in source code editors or docstring generation. The models are also fine-tuned to handle long contexts, extending the maximum context length from 4,096 tokens to 100,000 tokens. Instruction fine-tuning is used to improve safety and helpfulness, with CODE Llama - INSTRUCT variants further fine-tuned on a mix of proprietary instruction data and a new machine-generated self-instruct dataset. The models show significant improvements in performance on various truthfulness, toxicity, and bias benchmarks. The models are evaluated on major code generation benchmarks such as HumanEval, MBPP, and APPS, as well as a multilingual version of HumanEval (MultiPL-E). The results show that CODE Llama models establish a new state of the art amongst open-source LLMs. The technical details of the training and fine-tuning procedures are provided, along with in-depth experiments and ablation studies, details of the safety/helpfulness evaluations, and a discussion of related work. The models are also evaluated on infilling benchmarks and long context evaluations, showing significant improvements in performance on these tasks. The models are also evaluated on safety and coding performance, with CODE Llama - INSTRUCT models showing significant improvements in truthfulness and toxicity. The models are also evaluated on red teaming exercises, showing that they are able to resist malicious prompts and provide safer responses. The models are also evaluated on false refusals, showing that
Reach us at info@study.space
[slides and audio] Code Llama%3A Open Foundation Models for Code