OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection

OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection

23 Jul 2024 | Fan Cui, Chenyang Yin, Kexing Zhou, Youwei Xiao, Guangyu Sun, Qiang Xu, Qipeng Guo, Demin Song, Dahua Lin, Xingcheng Zhang, Yun (Eric) Liang
OriGen is an open-source framework designed to enhance RTL (Register Transfer Level) code generation by incorporating self-reflection and dataset augmentation. The framework introduces a novel code-to-code augmentation method that leverages knowledge distillation to improve the quality of open-source RTL code datasets. OriGen also features a self-reflection mechanism that enables it to autonomously correct syntactic errors using feedback from the compiler. The self-reflection ability is facilitated by a carefully constructed dataset that includes natural language instructions, erroneous code, compiler error messages, and corrected code. Experimental results show that OriGen significantly outperforms other open-source alternatives in RTL code generation, surpassing the previous best-performing LLM by 9.8% on the VerilogEval-Human benchmark. Additionally, OriGen exhibits superior capabilities in self-reflection and error rectification, surpassing GPT-4 by 18.1% on the VerilogFixEval benchmark. OriGen is trained on a large-scale dataset generated through code-to-code augmentation, which enhances the model's training data and improves its ability to generate and correct RTL code. The framework also includes a benchmark named VerilogFixEval to evaluate the model's self-reflection capabilities. OriGen's performance is evaluated on two benchmarks: VerilogEval and RTLLM. The results demonstrate that OriGen outperforms other models in both functional correctness and syntactic correctness. The framework's effectiveness is further validated through ablation studies, which show that the code-to-code augmentation method significantly improves the model's performance. Overall, OriGen provides a powerful solution for RTL code generation, offering high-quality, large-scale datasets and robust self-reflection capabilities.OriGen is an open-source framework designed to enhance RTL (Register Transfer Level) code generation by incorporating self-reflection and dataset augmentation. The framework introduces a novel code-to-code augmentation method that leverages knowledge distillation to improve the quality of open-source RTL code datasets. OriGen also features a self-reflection mechanism that enables it to autonomously correct syntactic errors using feedback from the compiler. The self-reflection ability is facilitated by a carefully constructed dataset that includes natural language instructions, erroneous code, compiler error messages, and corrected code. Experimental results show that OriGen significantly outperforms other open-source alternatives in RTL code generation, surpassing the previous best-performing LLM by 9.8% on the VerilogEval-Human benchmark. Additionally, OriGen exhibits superior capabilities in self-reflection and error rectification, surpassing GPT-4 by 18.1% on the VerilogFixEval benchmark. OriGen is trained on a large-scale dataset generated through code-to-code augmentation, which enhances the model's training data and improves its ability to generate and correct RTL code. The framework also includes a benchmark named VerilogFixEval to evaluate the model's self-reflection capabilities. OriGen's performance is evaluated on two benchmarks: VerilogEval and RTLLM. The results demonstrate that OriGen outperforms other models in both functional correctness and syntactic correctness. The framework's effectiveness is further validated through ablation studies, which show that the code-to-code augmentation method significantly improves the model's performance. Overall, OriGen provides a powerful solution for RTL code generation, offering high-quality, large-scale datasets and robust self-reflection capabilities.
Reach us at info@study.space