🤖 AI Summary
This work addresses the challenge that existing large language models (LLMs) struggle to refine RTL code for both functional correctness and physical performance (PPA) after deployment due to the lack of integration with EDA feedback. To bridge this gap, the authors propose TTT-RTL, a novel framework that establishes, for the first time, a test-time training loop tailored to individual RTL designs. It employs online reinforcement learning to optimize the model policy by jointly incorporating syntax checking, simulation-based verification, and PPA metrics. A key innovation is an adaptive KL budget controller that stabilizes policy updates under sparse rewards, complemented by a PUCT-based mechanism to reuse high-reward design states. Experiments demonstrate that TTT-RTL reduces the geometric mean PPA product by 65.1% on RTLLM v2.0 and achieves a 59.4% reduction in area-delay-power (ADP) on the XuanTie C910 FPU unit, substantially outperforming frozen-policy baselines.
📝 Abstract
Large language models (LLMs) have shown increasing promise in generating
functionally correct register-transfer-level (RTL) hardware designs.
Recent systems improve further through EDA-integrated reinforcement
learning with syntax, simulation, and PPA rewards, but train a general
RTL generator before deployment while test-time approaches search with
a frozen policy. We instead perform reinforcement learning at test time,
allowing the LLM policy to adapt to executable EDA feedback for the
specific RTL problem at hand. We propose TTT-RTL, to our knowledge the
first per-design test-time training framework that closes the loop
between an LLM policy and an EDA pipeline for RTL optimization. TTT-RTL
samples candidate implementations, verifies them through syntax checking
and simulation, scores valid designs using synthesis-derived PPA product,
reuses high-reward variants through a PUCT-indexed design-state pool,
and updates the policy with an entropic policy-gradient objective. To
stabilize policy updates under sparse or plateaued rewards, we introduce
an adaptive KL-budget controller that adjusts the entropy constraint
using reference KL, effective sample size, and reward saturation signals.
On RTLLM v2.0 under Nangate 45nm, TTT-RTL reduces the geometric-mean
PPA product by 65.1% over the reference, outperforming the strongest
published frozen-policy agent baseline at 26.1%. On an industrial
XuanTie C910 FPU leading-zero-anticipation unit under Sky130, TTT-RTL
achieves a 59.4% ADP reduction, and ablations confirm that policy
adaptation, state reuse, and KL-budget control each contribute. These
results suggest that test-time training with executable EDA feedback can
move LLM-based RTL generation beyond functional correctness toward
physically optimized hardware.