RTL-Repo is a benchmark designed to evaluate the performance of Large Language Models (LLMs) in generating Verilog code for large-scale RTL design projects. The benchmark includes over 4000 Verilog code samples extracted from public GitHub repositories, with each sample providing the full context of the corresponding repository. The dataset is split into training and test sets, with the training set containing 2924 samples and the test set containing 1174 samples. The benchmark evaluates several state-of-the-art models, including GPT-4, GPT-3.5, Starcoder2, VeriGen, and RTLCoder, in generating Verilog code for complex projects. The benchmark provides a valuable resource for the hardware design community to assess and compare LLMs' performance in real-world RTL design scenarios and train LLMs specifically for Verilog code generation in complex, multi-file RTL projects. RTL-Repo is open-source and publicly available on GitHub. The benchmark evaluates models based on two metrics: Exact Match (EM) and Edit Similarity (ES). The results show that GPT-4 significantly outperforms all other models in both metrics. The benchmark also highlights the challenges faced by models in handling long-range dependencies and multi-file contexts in Verilog code generation. The benchmark is designed to provide a more realistic and challenging evaluation of the model's performance in real-world RTL design scenarios. The benchmark is a living dataset that can be easily extended to include more samples from new repositories. The benchmark provides a comprehensive evaluation of the model's performance in generating Verilog code in large-scale codebases. The benchmark is compared to existing benchmarks such as RTLLM and VerilogEval, which focus on generating single Verilog files that are often small and do not interact with other components. In contrast, RTL-Repo evaluates the model's performance in generating Verilog code in multi-file, large-scale codebases, providing a more realistic and challenging evaluation of the model's performance in real-world RTL design scenarios. The benchmark is designed to provide a more accurate, realistic quantitative evaluation of a model's performance in real-world RTL design scenarios. The benchmark is a valuable resource for the hardware design community to assess and compare LLMs' performance in real-world RTL design scenarios. The benchmark is also a foundation for future research in fine-tuning open-source models and developing more advanced LLMs that can effectively handle large RTL codebases.RTL-Repo is a benchmark designed to evaluate the performance of Large Language Models (LLMs) in generating Verilog code for large-scale RTL design projects. The benchmark includes over 4000 Verilog code samples extracted from public GitHub repositories, with each sample providing the full context of the corresponding repository. The dataset is split into training and test sets, with the training set containing 2924 samples and the test set containing 1174 samples. The benchmark evaluates several state-of-the-art models, including GPT-4, GPT-3.5, Starcoder2, VeriGen, and RTLCoder, in generating Verilog code for complex projects. The benchmark provides a valuable resource for the hardware design community to assess and compare LLMs' performance in real-world RTL design scenarios and train LLMs specifically for Verilog code generation in complex, multi-file RTL projects. RTL-Repo is open-source and publicly available on GitHub. The benchmark evaluates models based on two metrics: Exact Match (EM) and Edit Similarity (ES). The results show that GPT-4 significantly outperforms all other models in both metrics. The benchmark also highlights the challenges faced by models in handling long-range dependencies and multi-file contexts in Verilog code generation. The benchmark is designed to provide a more realistic and challenging evaluation of the model's performance in real-world RTL design scenarios. The benchmark is a living dataset that can be easily extended to include more samples from new repositories. The benchmark provides a comprehensive evaluation of the model's performance in generating Verilog code in large-scale codebases. The benchmark is compared to existing benchmarks such as RTLLM and VerilogEval, which focus on generating single Verilog files that are often small and do not interact with other components. In contrast, RTL-Repo evaluates the model's performance in generating Verilog code in multi-file, large-scale codebases, providing a more realistic and challenging evaluation of the model's performance in real-world RTL design scenarios. The benchmark is designed to provide a more accurate, realistic quantitative evaluation of a model's performance in real-world RTL design scenarios. The benchmark is a valuable resource for the hardware design community to assess and compare LLMs' performance in real-world RTL design scenarios. The benchmark is also a foundation for future research in fine-tuning open-source models and developing more advanced LLMs that can effectively handle large RTL codebases.