Balanced segmentation of CNNs for multi-TPU inference

📅 2024-10-22

🏛️ Journal of Supercomputing

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

To address load imbalance, on-chip memory bottlenecks, and excessive inter-TPU communication overhead caused by uneven CNN model partitioning in multi-Edge TPU collaborative inference, this paper proposes a computation–communication co-modeling dynamic segmentation algorithm. For the first time, it achieves load-balanced partitioning at the CNN layer granularity, integrating graph partitioning, Edge TPU hardware-aware modeling, and topology-aware communication optimization, complemented by a lightweight runtime scheduler. Evaluated on ResNet-50 and ViT-L, the method achieves >92% TPU utilization, reduces end-to-end latency by 37%, and attains 2.1× higher throughput than baseline approaches. Key contributions are: (1) a fine-grained, dynamic segmentation framework supporting heterogeneous Edge TPU clusters; and (2) a joint optimization paradigm that simultaneously maximizes computational efficiency and communication locality.