🤖 AI Summary
To address the challenge of deploying large language models (LLMs) under resource-constrained conditions, this paper proposes a hardware-aware end-to-end framework that jointly optimizes model compression and efficient fine-tuning. Methodologically, it introduces the first integration of Low-Rank Adaptation (LoRA) into differentiable Neural Architecture Search (NAS), enabling automatic discovery of optimal lightweight adapter architectures within a weight-sharing supernet. By unifying low-rank matrix decomposition with Parameter-Efficient Fine-Tuning (PEFT), the framework achieves synergistic optimization of inference latency reduction and memory footprint compression. Extensive experiments demonstrate that the compressed models retain ≥98% of the original performance across multiple benchmark tasks, while substantially decreasing GPU memory consumption and inference latency. The proposed method thus bridges architectural efficiency and hardware constraints without compromising accuracy. All code and trained models are publicly released.
📝 Abstract
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.