The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

2024 | Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, Yossi Adi
The paper "The Larger the Better? Improved LLM Code-Generation via Budget Reallocation" investigates whether larger language models (LLMs) outperform smaller ones in code generation tasks when both are constrained by the same compute budget. The study compares different LLM sizes, including 7B, 13B, 34B, and 70B models, and evaluates their performance on code generation benchmarks such as HumanEval, MBPP, and APPS. The results show that smaller models can achieve comparable or better performance than larger ones under the same compute budget, with gains of up to 15% across five tasks. This is particularly true when unit-tests are available, as they allow for selecting the best output from multiple generations of smaller models. However, when unit-tests are unavailable, ranking-based selection of outputs from smaller models falls short of the performance of a single output from larger models. The study also highlights the importance of developing effective ranking approaches for LLM outputs, especially in scenarios where no unit-tests or other verification methods are available. The authors release over 1 million code generation outputs from the 7B Code Llama model for the HumanEval and MBPP benchmarks to support further research in this area. The findings suggest that smaller models can be more efficient and effective in code generation tasks under fixed compute budgets, and that further research into ranking methods is needed to improve the performance of LLMs in scenarios without unit-tests.The paper "The Larger the Better? Improved LLM Code-Generation via Budget Reallocation" investigates whether larger language models (LLMs) outperform smaller ones in code generation tasks when both are constrained by the same compute budget. The study compares different LLM sizes, including 7B, 13B, 34B, and 70B models, and evaluates their performance on code generation benchmarks such as HumanEval, MBPP, and APPS. The results show that smaller models can achieve comparable or better performance than larger ones under the same compute budget, with gains of up to 15% across five tasks. This is particularly true when unit-tests are available, as they allow for selecting the best output from multiple generations of smaller models. However, when unit-tests are unavailable, ranking-based selection of outputs from smaller models falls short of the performance of a single output from larger models. The study also highlights the importance of developing effective ranking approaches for LLM outputs, especially in scenarios where no unit-tests or other verification methods are available. The authors release over 1 million code generation outputs from the 7B Code Llama model for the HumanEval and MBPP benchmarks to support further research in this area. The findings suggest that smaller models can be more efficient and effective in code generation tasks under fixed compute budgets, and that further research into ranking methods is needed to improve the performance of LLMs in scenarios without unit-tests.
Reach us at info@study.space