Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

While tensor computation acceleration in AI SoCs is well-developed, tensor operations—characterized by high data movement and low arithmetic intensity—remain under-optimized. Method: This paper proposes a near-memory reconfigurable Tensor Manipulation Unit (TMU), adopting a memory-to-memory dataflow paradigm and a RISC-inspired execution model to support unified coarse- and fine-grained tensor transformations. It introduces the first end-to-end pipelined co-execution between TMU and TPU, incorporating dual-buffering and output-forwarding mechanisms. Contribution/Results: Implemented in SMIC 40nm CMOS, the TMU occupies only 0.019 mm² and supports >10 tensor operation types. Compared to ARM Cortex-A72 and NVIDIA Jetson TX2, it reduces latency by 1413× and 8.54×, respectively. When integrated with a TPU, the joint architecture achieves a 34.6% reduction in end-to-end inference latency.

Technology Category

Application Category

📝 Abstract

While recent advances in AI SoC design have focused heavily on accelerating tensor computation, the equally critical task of tensor manipulation, centered on high,volume data movement with minimal computation, remains underexplored. This work addresses that gap by introducing the Tensor Manipulation Unit (TMU), a reconfigurable, near-memory hardware block designed to efficiently execute data-movement-intensive operators. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations. Integrated alongside a TPU within a high-throughput AI SoC, the TMU leverages double buffering and output forwarding to improve pipeline utilization. Fabricated in SMIC 40nm technology, the TMU occupies only 0.019 mm2 while supporting over 10 representative tensor manipulation operators. Benchmarking shows that TMU alone achieves up to 1413 and 8.54 operator-level latency reduction compared to ARM A72 and NVIDIA Jetson TX2, respectively. When integrated with the in-house TPU, the complete system achieves a 34.6% reduction in end-to-end inference latency, demonstrating the effectiveness and scalability of reconfigurable tensor manipulation in modern AI SoCs.

Problem

Research questions and friction points this paper is trying to address.

Addresses underexplored tensor manipulation in AI SoCs

Introduces reconfigurable near-memory hardware for data movement

Improves efficiency of tensor transformations in AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconfigurable near-memory tensor manipulation hardware

RISC-inspired execution model for data movement

Double buffering and output forwarding optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow