🤖 AI Summary
While tensor computation acceleration in AI SoCs is well-developed, tensor operations—characterized by high data movement and low arithmetic intensity—remain under-optimized.
Method: This paper proposes a near-memory reconfigurable Tensor Manipulation Unit (TMU), adopting a memory-to-memory dataflow paradigm and a RISC-inspired execution model to support unified coarse- and fine-grained tensor transformations. It introduces the first end-to-end pipelined co-execution between TMU and TPU, incorporating dual-buffering and output-forwarding mechanisms.
Contribution/Results: Implemented in SMIC 40nm CMOS, the TMU occupies only 0.019 mm² and supports >10 tensor operation types. Compared to ARM Cortex-A72 and NVIDIA Jetson TX2, it reduces latency by 1413× and 8.54×, respectively. When integrated with a TPU, the joint architecture achieves a 34.6% reduction in end-to-end inference latency.
📝 Abstract
While recent advances in AI SoC design have focused heavily on accelerating tensor computation, the equally critical task of tensor manipulation, centered on high,volume data movement with minimal computation, remains underexplored. This work addresses that gap by introducing the Tensor Manipulation Unit (TMU), a reconfigurable, near-memory hardware block designed to efficiently execute data-movement-intensive operators. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations. Integrated alongside a TPU within a high-throughput AI SoC, the TMU leverages double buffering and output forwarding to improve pipeline utilization. Fabricated in SMIC 40nm technology, the TMU occupies only 0.019 mm2 while supporting over 10 representative tensor manipulation operators. Benchmarking shows that TMU alone achieves up to 1413 and 8.54 operator-level latency reduction compared to ARM A72 and NVIDIA Jetson TX2, respectively. When integrated with the in-house TPU, the complete system achieves a 34.6% reduction in end-to-end inference latency, demonstrating the effectiveness and scalability of reconfigurable tensor manipulation in modern AI SoCs.