🤖 AI Summary
This work addresses the limited scale and absence of natural language descriptions in existing open-source Verilog datasets, which hinder the application of large language models (LLMs) in hardware generation. The authors present the largest open-source Verilog dataset to date, comprising over 131,000 modules sourced from GitHub and augmented with samples translated from VHDL and C++. Each module is paired with a natural language description generated by the DeepSeek-R1 model. This dataset uniquely integrates multi-source data, semantic annotations, and full open-access licensing, enabling fine-tuning and evaluation of LLMs ranging from 7B to 32B parameters (e.g., Qwen, Granite). Experimental results demonstrate that purely open-source models trained on this dataset achieve strong performance on hardware design tasks, offering a commercially usable foundational resource for both academia and industry.
📝 Abstract
OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.