[slides and audio] RTL-Repo%3A A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects

The paper introduces RTL-Repo, a benchmark designed to evaluate the performance of Large Language Models (LLMs) in generating Verilog code for large-scale Register Transfer Level (RTL) design projects. The benchmark includes a comprehensive dataset of over 4000 Verilog code samples extracted from public GitHub repositories, each providing the full context of the corresponding repository. The evaluation covers several state-of-the-art models, including GPT-4, GPT-3.5, Starcoder2, VeriGen, and RTLCoder, comparing their performance in generating Verilog code for complex projects. The results show that GPT-4 significantly outperforms other models, while open-source Verilog-specific models struggle with long-range dependencies and multi-file contexts. The RTL-Repo benchmark provides a valuable resource for the hardware design community to assess and compare LLMs' performance in real-world RTL design scenarios and serves as a foundation for future research in fine-tuning models and developing more advanced LLMs for large RTL codebases.The paper introduces RTL-Repo, a benchmark designed to evaluate the performance of Large Language Models (LLMs) in generating Verilog code for large-scale Register Transfer Level (RTL) design projects. The benchmark includes a comprehensive dataset of over 4000 Verilog code samples extracted from public GitHub repositories, each providing the full context of the corresponding repository. The evaluation covers several state-of-the-art models, including GPT-4, GPT-3.5, Starcoder2, VeriGen, and RTLCoder, comparing their performance in generating Verilog code for complex projects. The results show that GPT-4 significantly outperforms other models, while open-source Verilog-specific models struggle with long-range dependencies and multi-file contexts. The RTL-Repo benchmark provides a valuable resource for the hardware design community to assess and compare LLMs' performance in real-world RTL design scenarios and serves as a foundation for future research in fine-tuning models and developing more advanced LLMs for large RTL codebases.

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects

2024 | Ahmed Allam, Mohamed Shalan