Instruction Tuning for Secure Code Generation

Instruction Tuning for Secure Code Generation

2024 | Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev
SafeCoder is a novel instruction tuning method designed to enhance the security of code generated by large language models (LLMs). Existing instruction tuning methods often overlook the security of generated code, leading to frequent production of unsafe code. SafeCoder addresses this by incorporating security-centric fine-tuning using a diverse and high-quality dataset. It combines standard instruction tuning with security fine-tuning to achieve joint optimization of security and utility. SafeCoder significantly improves code security by about 30% while preserving utility across various LMs and datasets. The method uses a masked language modeling loss for secure code and an unlikelihood loss for unsafe code, with both losses appropriately masked to focus on security-critical parts of the programs. SafeCoder also employs an automated pipeline to collect high-quality security datasets from GitHub, ensuring comprehensive coverage of vulnerability types and programming languages. The method is effective across a wide range of LMs and datasets, achieving a secure code generation rate of approximately 90%. SafeCoder is open-sourced, and its code and datasets are available for community use. The work demonstrates that SafeCoder can significantly enhance the security of code generation without compromising utility, making it a valuable tool for improving the safety of LMs in programming tasks.SafeCoder is a novel instruction tuning method designed to enhance the security of code generated by large language models (LLMs). Existing instruction tuning methods often overlook the security of generated code, leading to frequent production of unsafe code. SafeCoder addresses this by incorporating security-centric fine-tuning using a diverse and high-quality dataset. It combines standard instruction tuning with security fine-tuning to achieve joint optimization of security and utility. SafeCoder significantly improves code security by about 30% while preserving utility across various LMs and datasets. The method uses a masked language modeling loss for secure code and an unlikelihood loss for unsafe code, with both losses appropriately masked to focus on security-critical parts of the programs. SafeCoder also employs an automated pipeline to collect high-quality security datasets from GitHub, ensuring comprehensive coverage of vulnerability types and programming languages. The method is effective across a wide range of LMs and datasets, achieving a secure code generation rate of approximately 90%. SafeCoder is open-sourced, and its code and datasets are available for community use. The work demonstrates that SafeCoder can significantly enhance the security of code generation without compromising utility, making it a valuable tool for improving the safety of LMs in programming tasks.
Reach us at info@study.space