🤖 AI Summary
This work addresses the challenge that existing neural network pruning methods often fail to strictly adhere to a specified multiply-accumulate (MAC) operations budget during deployment, leading to unpredictable inference latency. To resolve this, we propose the first pruning framework integrating large language models with multi-agent collaboration. Our approach employs a Profiling Agent, a Master Agent, and an Analysis Agent powered by Claude 3.5 Sonnet, which jointly leverage context-aware policy learning and isomorphic pruning graph grouping to iteratively converge within a user-defined MAC tolerance. Evaluated on ImageNet-1K, our method improves accuracy by 0.91% and 1.56% for ResNet-50 and ResNet-101, respectively, achieves 1.41× GPU speedup with 45% parameter compression for ConvNeXt-Small, and enables Vision Transformers to precisely meet prescribed MAC constraints.
📝 Abstract
Neural network pruning remains essential for deploying deep learning models on resource-constrained devices, yet existing approaches primarily target parameter reduction without directly controlling computational cost. This yields unpredictable inference latency in deployment scenarios where strict Multiply-Accumulate (MAC) operation budgets must be met. We propose AgenticPruner, a framework utilizing large language models to achieve MAC-constrained optimization through iterative strategy learning. Our approach coordinates three specialized agents: a Profiling Agent that analyzes model architecture and MAC distributions, a Master Agent that orchestrates the workflow with divergence monitoring, and an Analysis Agent powered by Claude 3.5 Sonnet that learns optimal strategies from historical attempts. Through in-context learning, the Analysis Agent improves convergence success rate from 48% to 71% compared to grid search. Building upon isomorphic pruning's graph-based structural grouping, our method adds context-aware adaptation by analyzing patterns across pruning iterations, enabling automatic convergence to target MAC budgets within user-defined tolerance bands. We validate our framework on ImageNet-1K across ResNet, ConvNeXt, and DeiT architectures. On CNNs, our approach achieves MAC targeting while maintaining or improving accuracy: ResNet-50 reaches 1.77G MACs with 77.04% accuracy (+0.91% vs baseline); ResNet-101 achieves 4.22G MACs with 78.94% accuracy (+1.56% vs baseline). For ConvNeXt-Small, pruning to 8.17G MACs yields 1.41x GPU and 1.07x CPU speedup with 45% parameter reduction. On Vision Transformers, we demonstrate MAC-budget compliance within user-defined tolerance bands (typically +1% to +5% overshoot, -5% to -15% undershoot), establishing feasibility for deployment scenarios requiring strict computational guarantees.