Balanced segmentation of CNNs for multi-TPU inference

📅 2024-10-22
🏛️ Journal of Supercomputing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address load imbalance, on-chip memory bottlenecks, and excessive inter-TPU communication overhead caused by uneven CNN model partitioning in multi-Edge TPU collaborative inference, this paper proposes a computation–communication co-modeling dynamic segmentation algorithm. For the first time, it achieves load-balanced partitioning at the CNN layer granularity, integrating graph partitioning, Edge TPU hardware-aware modeling, and topology-aware communication optimization, complemented by a lightweight runtime scheduler. Evaluated on ResNet-50 and ViT-L, the method achieves >92% TPU utilization, reduces end-to-end latency by 37%, and attains 2.1× higher throughput than baseline approaches. Key contributions are: (1) a fine-grained, dynamic segmentation framework supporting heterogeneous Edge TPU clusters; and (2) a joint optimization paradigm that simultaneously maximizes computational efficiency and communication locality.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Optimize CNN segmentation for multi-TPU inference
Balance workload across TPUs to reduce imbalance
Address memory access bottleneck in Edge TPU systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced CNN segmentation for multi-TPU inference
Compiler-based pipelined inference implementation
Profiled-based workload balancing across TPUs
🔎 Similar Papers
No similar papers found.