ArchXBench: A Complex Digital Systems Benchmark Suite for LLM Driven RTL Synthesis

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Large language models (LLMs) lack rigorous evaluation in generating complex digital circuit RTL, particularly for deep-pipelined and domain-specific accelerator designs. Method: This paper introduces the first six-level hierarchical benchmark suite targeting such architectures, covering representative SoC datapath scenarios—including cryptography, image processing, and machine learning—with formal problem specifications, design constraints, and executable testbenches to enable zero-shot, structured RTL synthesis research. Contribution/Results: We systematically evaluate Claude Sonnet 4, GPT-4.1, o4-mini-high, and DeepSeek R1 under the pass@5 metric. o4-mini-high achieves 53.3% success (16/30) on Levels 1–3 but fails entirely on Levels 4–6. This reveals fundamental LLM limitations in hierarchical module integration and multi-cycle pipeline control—previously uncharacterized bottlenecks critical to industrial-strength RTL generation.

Technology Category

Application Category

📝 Abstract

Modern SoC datapaths include deeply pipelined, domain-specific accelerators, but their RTL implementation and verification are still mostly done by hand. While large language models (LLMs) exhibit advanced code-generation abilities for programming languages like Python, their application to Verilog-like RTL remains in its nascent stage. This is reflected in the simple arithmetic and control circuits currently used to evaluate generative capabilities in existing benchmarks. In this paper, we introduce ArchXBench, a six-level benchmark suite that encompasses complex arithmetic circuits and other advanced digital subsystems drawn from domains such as cryptography, image processing, machine learning, and signal processing. Architecturally, some of these designs are purely combinational, others are multi-cycle or pipelined, and many require hierarchical composition of modules. For each benchmark, we provide a problem description, design specification, and testbench, enabling rapid research in the area of LLM-driven agentic approaches for complex digital systems design. Using zero-shot prompting with Claude Sonnet 4, GPT 4.1, o4-mini-high, and DeepSeek R1 under a pass@5 criterion, we observed that o4-mini-high successfully solves the largest number of benchmarks, 16 out of 30, spanning Levels 1, 2, and 3. From Level 4 onward, however, all models consistently fail, highlighting a clear gap in the capabilities of current state-of-the-art LLMs and prompting/agentic approaches.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate complex RTL designs

Addressing lack of benchmarks for advanced digital subsystems

Assessing LLM performance on multi-level RTL synthesis tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

ArchXBench benchmark suite for complex RTL

Zero-shot prompting with multiple LLMs

Hierarchical module composition in benchmarks

🔎 Similar Papers

No similar papers found.