[slides and audio] CodeGemma%3A Open Code Models Based on Gemma

CodeGemma is a collection of specialized open code models built on top of Google DeepMind’s Gemma models, designed for various code and natural language generation tasks. The paper introduces three model variants: CodeGemma 7B (pretrained and instruction-tuned), and CodeGemma 2B, a state-of-the-art code completion model. These models are trained on extensive code and natural language data, achieving advanced performance in both code completion and generation tasks while maintaining strong understanding and reasoning skills. The models are trained using a combination of web documents, mathematics, and code, with specific focus on improving their capabilities in code infilling and open-ended generation. The 2B model is particularly fast, making it suitable for latency-sensitive applications, while the 7B models offer robust performance in coding tasks. The paper details the pretraining and instruction-tuning processes, including the use of fill-in-the-middle (FIM) tasks and synthetic data to enhance mathematical reasoning and problem-solving skills. Evaluations across various domains, such as code completion, multi-lingual coding, and natural language understanding, demonstrate the models' superior performance compared to other models in their class. The authors provide recommendations for using the models, including prompt formatting and stopping strategies, and highlight the practical considerations for deployment in different settings. The paper concludes by emphasizing the transferability of the technologies used in Gemma and CodeGemma to downstream applications, and the broader community's interest in these models.CodeGemma is a collection of specialized open code models built on top of Google DeepMind’s Gemma models, designed for various code and natural language generation tasks. The paper introduces three model variants: CodeGemma 7B (pretrained and instruction-tuned), and CodeGemma 2B, a state-of-the-art code completion model. These models are trained on extensive code and natural language data, achieving advanced performance in both code completion and generation tasks while maintaining strong understanding and reasoning skills. The models are trained using a combination of web documents, mathematics, and code, with specific focus on improving their capabilities in code infilling and open-ended generation. The 2B model is particularly fast, making it suitable for latency-sensitive applications, while the 7B models offer robust performance in coding tasks. The paper details the pretraining and instruction-tuning processes, including the use of fill-in-the-middle (FIM) tasks and synthetic data to enhance mathematical reasoning and problem-solving skills. Evaluations across various domains, such as code completion, multi-lingual coding, and natural language understanding, demonstrate the models' superior performance compared to other models in their class. The authors provide recommendations for using the models, including prompt formatting and stopping strategies, and highlight the practical considerations for deployment in different settings. The paper concludes by emphasizing the transferability of the technologies used in Gemma and CodeGemma to downstream applications, and the broader community's interest in these models.

CodeGemma: Open Code Models Based on Gemma

2024-05-08 | CodeGemma Team, Google LLC