A New Benchmark for the Appropriate Evaluation of RTL Code Optimization

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work proposes RTL-OPT, a new benchmark addressing the limitations of existing evaluations that primarily focus on syntactic correctness of RTL code while inadequately assessing power, performance, and area (PPA) optimization quality. RTL-OPT comprises 36 handcrafted digital circuit tasks, each paired with unoptimized and expert-optimized RTL implementations. The benchmark introduces an end-to-end automated evaluation pipeline that integrates formal functional equivalence checking with quantitative PPA analysis. It enables the first systematic assessment of large language models’ capabilities in RTL optimization, incorporating optimization patterns reflective of industrial practice and capturing dimensions often overlooked by conventional synthesis tools. RTL-OPT thus provides a standardized, quantifiable platform for evaluating LLM-driven hardware design optimization.

Technology Category

Application Category

📝 Abstract

The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but existing benchmarks mainly evaluate syntactic correctness rather than optimization quality in terms of power, performance, and area (PPA). This work introduces RTL-OPT, a benchmark for assessing the capability of LLMs in RTL optimization. RTL-OPT contains 36 handcrafted digital designs that cover diverse implementation categories including combinational logic, pipelined datapaths, finite state machines, and memory interfaces. Each task provides a pair of RTL codes, a suboptimal version and a human-optimized reference that reflects industry-proven optimization patterns not captured by conventional synthesis tools. Furthermore, RTL-OPT integrates an automated evaluation framework to verify functional correctness and quantify PPA improvements, enabling standardized and meaningful assessment of generative models for hardware design optimization.

Problem

Research questions and friction points this paper is trying to address.

RTL optimization

large language models

PPA evaluation

hardware design

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

RTL optimization

large language models

hardware design benchmark