Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While tensor computation acceleration in AI SoCs is well-developed, tensor operations—characterized by high data movement and low arithmetic intensity—remain under-optimized. Method: This paper proposes a near-memory reconfigurable Tensor Manipulation Unit (TMU), adopting a memory-to-memory dataflow paradigm and a RISC-inspired execution model to support unified coarse- and fine-grained tensor transformations. It introduces the first end-to-end pipelined co-execution between TMU and TPU, incorporating dual-buffering and output-forwarding mechanisms. Contribution/Results: Implemented in SMIC 40nm CMOS, the TMU occupies only 0.019 mm² and supports >10 tensor operation types. Compared to ARM Cortex-A72 and NVIDIA Jetson TX2, it reduces latency by 1413× and 8.54×, respectively. When integrated with a TPU, the joint architecture achieves a 34.6% reduction in end-to-end inference latency.

Technology Category

Application Category

📝 Abstract
While recent advances in AI SoC design have focused heavily on accelerating tensor computation, the equally critical task of tensor manipulation, centered on high,volume data movement with minimal computation, remains underexplored. This work addresses that gap by introducing the Tensor Manipulation Unit (TMU), a reconfigurable, near-memory hardware block designed to efficiently execute data-movement-intensive operators. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations. Integrated alongside a TPU within a high-throughput AI SoC, the TMU leverages double buffering and output forwarding to improve pipeline utilization. Fabricated in SMIC 40nm technology, the TMU occupies only 0.019 mm2 while supporting over 10 representative tensor manipulation operators. Benchmarking shows that TMU alone achieves up to 1413 and 8.54 operator-level latency reduction compared to ARM A72 and NVIDIA Jetson TX2, respectively. When integrated with the in-house TPU, the complete system achieves a 34.6% reduction in end-to-end inference latency, demonstrating the effectiveness and scalability of reconfigurable tensor manipulation in modern AI SoCs.
Problem

Research questions and friction points this paper is trying to address.

Addresses underexplored tensor manipulation in AI SoCs
Introduces reconfigurable near-memory hardware for data movement
Improves efficiency of tensor transformations in AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconfigurable near-memory tensor manipulation hardware
RISC-inspired execution model for data movement
Double buffering and output forwarding optimization
🔎 Similar Papers
No similar papers found.
W
Weiyu Zhou
Faculty of Science and Technology, University of Macau, Macau, China
Z
Zheng Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
C
Chao Chen
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Yike Li
Yike Li
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; School of Electrical and Electronic Engineering, University College Dublin, Dublin, Ireland
Yongkui Yang
Yongkui Yang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
VLSIMixed-signal ICComputer Architecture
Z
Zhuoyu Wu
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Anupam Chattopadhyay
Anupam Chattopadhyay
Associate Professor, CCDS, NTU, Singapore
EDACPS SecurityAI SecurityQuantum ComputingPost-Quantum Cryptography