OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection

OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection

23 Jul 2024 | Fan Cui, Chenyang Yin, Kexing Zhou, Youwei Xiao, Guangyu Sun, Qiang Xu, Qipeng Guo, Demin Song, Dahua Lin, Xingcheng Zhang, Yun (Eric) Liang
OriGen is an open-source framework designed to enhance RTL (Register Transfer Level) code generation. It introduces a novel code-to-code augmentation methodology and a self-reflection mechanism to improve the quality and accuracy of generated RTL code. The framework addresses the limitations of existing open-source LLMs in RTL code generation, which often lack the privacy and security concerns associated with commercial models but perform poorly due to the scarcity of high-quality RTL datasets. Key contributions of OriGen include: 1. **Code-to-Code Augmentation**: This methodology leverages knowledge distillation from commercial LLMs like Claude3-Haiku to enhance the quality of open-source RTL code datasets. 2. **Self-Reflection Mechanism**: OriGen can correct syntactic errors by leveraging compiler feedback, improving the model's ability to generate correct and reliable RTL code. 3. **Dataset Augmentation**: A comprehensive dataset is constructed to facilitate the self-reflection process, ensuring the model can handle a broad spectrum of errors. Experimental results demonstrate that OriGen outperforms other open-source alternatives in RTL code generation, achieving a 9.8% improvement over the previous best-performing LLM on the VerilogEval-Human benchmark. Additionally, OriGen surpasses GPT-4 by 18.1% in syntactic correctness on the VerilogFixEval benchmark, showcasing superior self-reflection and error rectification capabilities.OriGen is an open-source framework designed to enhance RTL (Register Transfer Level) code generation. It introduces a novel code-to-code augmentation methodology and a self-reflection mechanism to improve the quality and accuracy of generated RTL code. The framework addresses the limitations of existing open-source LLMs in RTL code generation, which often lack the privacy and security concerns associated with commercial models but perform poorly due to the scarcity of high-quality RTL datasets. Key contributions of OriGen include: 1. **Code-to-Code Augmentation**: This methodology leverages knowledge distillation from commercial LLMs like Claude3-Haiku to enhance the quality of open-source RTL code datasets. 2. **Self-Reflection Mechanism**: OriGen can correct syntactic errors by leveraging compiler feedback, improving the model's ability to generate correct and reliable RTL code. 3. **Dataset Augmentation**: A comprehensive dataset is constructed to facilitate the self-reflection process, ensuring the model can handle a broad spectrum of errors. Experimental results demonstrate that OriGen outperforms other open-source alternatives in RTL code generation, achieving a 9.8% improvement over the previous best-performing LLM on the VerilogEval-Human benchmark. Additionally, OriGen surpasses GPT-4 by 18.1% in syntactic correctness on the VerilogFixEval benchmark, showcasing superior self-reflection and error rectification capabilities.
Reach us at info@study.space