Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit limited capability in RTL hardware design and verification—including code generation, formal/functional verification, debugging, specification alignment, and technical Q&A—and lack a systematic, real-world engineering benchmark. Method: We introduce CVDP, the first comprehensive RTL-oriented benchmark, comprising 783 expert-authored tasks spanning 13 industrial scenarios. It features a novel dual-mode task design—non-agent (direct code generation) and agent (interactive, stepwise reasoning)—to emphasize RTL reuse and collaborative verification challenges. We build an automated evaluation pipeline grounded in open-source EDA tools (Yosys/Icarus), integrating BLEU, LLM-based judging, and pass@1 metrics. Results: Experiments show state-of-the-art models achieve at most 34% pass@1 on code generation; performance degrades further on agent-mode tasks, exposing fundamental limitations of LLMs in hardware automation.

Technology Category

Application Category

📝 Abstract
We present the Comprehensive Verilog Design Problems (CVDP) benchmark, a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. CVDP includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical Q&A authored by experienced hardware engineers. Problems are offered in both non-agentic and agentic formats. The benchmark introduces more realistic and challenging contexts than prior work, with state-of-the-art models achieving no more than 34% pass@1 on code generation. Agentic tasks$unicode{x2013}$especially those involving RTL reuse and verification$unicode{x2013}$are particularly difficult. Evaluation uses open-source tools and model scoring infrastructure, with comprehension tasks assessed via BLEU and LLM-based judging. CVDP reveals substantial gaps in current model capabilities, underscoring the need for continued research toward robust, real-world hardware design automation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on RTL design and verification tasks
Assessing model performance on hardware engineering challenges
Addressing gaps in automated hardware design capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive Verilog benchmark for LLM evaluation
Includes 783 RTL design and verification problems
Uses open-source tools and model scoring infrastructure
🔎 Similar Papers
No similar papers found.