Wrong Code, Right Structure: Learning Netlist Representations from Imperfect LLM-Generated RTL

๐Ÿ“… 2026-03-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the scarcity of real-world circuit netlist data and the high cost of expert annotations by proposing an end-to-end framework that leverages large language models (LLMs) to generate structurally plausible but functionally imperfect RTL code. These synthetic designs are processed through logic synthesis to construct a large-scale netlist dataset for training graph neural network (GNN)-based representation models, augmented with tailored data augmentation strategies. Notably, this approach is the first to systematically treat LLM-generated RTLโ€”with inherent functional errorsโ€”as a valid supervisory signal, thereby circumventing the bottleneck of requiring high-quality labeled data. Evaluated on tasks such as subcircuit boundary detection and component classification, the model achieves performance on real netlists that matches or even surpasses existing methods reliant on meticulously curated datasets, demonstrating successful scalability from operator-level to IP-level circuits.

Technology Category

Application Category

๐Ÿ“ Abstract
Learning effective netlist representations is fundamentally constrained by the scarcity of labeled datasets, as real designs are protected by Intellectual Property (IP) and costly to annotate. Existing work therefore focuses on small-scale circuits with clean labels, limiting scalability to realistic designs. Meanwhile, Large Language Models (LLMs) can generate Register-Transfer-Level (RTL) at scale, but their functional incorrectness has hindered their use in circuit analysis. In this work, we make a key observation: even when LLM-Generated RTL is functionally imperfect, the synthesized netlists still preserve structural patterns that are strongly indicative of the intended functionality. Building on this insight, we propose a cost-effective data augmentation and training framework that systematically exploits imperfect LLM-Generated RTL as training data for netlist representation learning, forming an end-to-end pipeline from automated code generation to downstream tasks. We conduct evaluations on circuit functional understanding tasks, including sub-circuit boundary identification and component classification, across benchmarks of increasing scales, extending the task scope from operator-level to IP-level. The evaluations demonstrate that models trained on our noisy synthetic corpus generalize well to real-world netlists, matching or even surpassing methods trained on scarce high-quality data and effectively breaking the data bottleneck in circuit representation learning.
Problem

Research questions and friction points this paper is trying to address.

netlist representation learning
data scarcity
Intellectual Property
circuit analysis
RTL generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

netlist representation learning
LLM-generated RTL
data augmentation
circuit functional understanding
synthetic training data
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Siyang Cai
CICS, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
C
Cangyuan Li
CICS, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Y
Yinhe Han
CICS, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Ying Wang
Ying Wang
Institute of Computing Technology, Chinese Academy of Sciences
Reliable Computer ArchitectureVLSI designMachine learningMemory system