🤖 AI Summary
The hardware design domain suffers from a scarcity of high-quality instruction-code pairs, reproducible benchmarks, and robust functional correctness verification mechanisms for LLM-assisted RTL design. Method: This paper introduces the first open-source dataset and benchmarking framework tailored for LLM-powered RTL design. It proposes RTLLM 2.0 (for RTL code generation) and AssertEval (for assertion generation) as dual-task benchmarks, and develops a novel RTL-simulation-based data quality filtering method to curate a 7K-sample dataset of high-confidence, functionally verified examples. Contribution/Results: Through systematic instruction-code pair construction, rigorous data cleaning, targeted model fine-tuning, and comprehensive evaluation, the framework significantly improves functional correctness in LLM-generated RTL. Experiments demonstrate that synergistic optimization of data scale, quality, and training strategy systematically enhances model performance—establishing a reproducible, verifiable infrastructure to advance LLMs in hardware design.
📝 Abstract
The automated generation of design RTL based on large language model (LLM) and natural language instructions has demonstrated great potential in agile circuit design. However, the lack of datasets and benchmarks in the public domain prevents the development and fair evaluation of LLM solutions. This paper highlights our latest advances in open datasets and benchmarks from three perspectives: (1) RTLLM 2.0, an updated benchmark assessing LLM's capability in design RTL generation. The benchmark is augmented to 50 hand-crafted designs. Each design provides the design description, test cases, and a correct RTL code. (2) AssertEval, an open-source benchmark assessing the LLM's assertion generation capabilities for RTL verification. The benchmark includes 18 designs, each providing specification, signal definition, and correct RTL code. (3) RTLCoder-Data, an extended open-source dataset with 80K instruction-code data samples. Moreover, we propose a new verification-based method to verify the functionality correctness of training data samples. Based on this technique, we further release a dataset with 7K verified high-quality samples. These three studies are integrated into one framework, providing off-the-shelf support for the development and evaluation of LLMs for RTL code generation and verification. Finally, extensive experiments indicate that LLM performance can be boosted by enlarging the training dataset, improving data quality, and improving the training scheme.