ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address computational and memory bottlenecks hindering the deployment of foundation models on resource-constrained platforms, existing compression methods suffer from oversimplified, uniform heuristics that ignore architectural and runtime heterogeneity. This paper proposes a performance-analysis-driven multi-agent inference framework that jointly leverages static structural metrics and dynamic runtime signals, employing large language models to enable hierarchical, architecture-aware pruning and quantization decisions. Our method integrates structured pruning with post-training dynamic quantization, optimizing simultaneously for MACs, latency, memory footprint, and accuracy. Evaluated across multiple models and datasets, it achieves a 74% memory reduction, 1.74× inference speedup, and only ~1% top-1 accuracy degradation on ImageNet-1K—while in certain scenarios even yielding up to a 2% accuracy improvement.

Technology Category

Application Category

📝 Abstract

Foundation models face growing compute and memory bottlenecks, hindering deployment on resource-limited platforms. While compression techniques such as pruning and quantization are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post-training dynamic quantization. Our modular multi-agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture-specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show pruning maintains competitive or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), while quantization achieves up to 74% memory savings with <0.5% accuracy loss. Our quantization also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo highlight the importance of LLM reasoning quality for iterative pruning. These results establish agentic systems as scalable solutions for profiling-guided model optimization.

Problem

Research questions and friction points this paper is trying to address.

Optimizing foundation models for resource-limited platforms

Addressing compute and memory bottlenecks via adaptive compression

Automating pruning and quantization with profiling-guided LLM agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven automated compression via pruning and quantization

Multi-agent system using static and dynamic profiling metrics

Layer-wise tailored strategies for architecture-specific optimization

🔎 Similar Papers

No similar papers found.