Low-Rank Adapters Meet Neural Architecture Search for LLM Compression

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the challenge of deploying large language models (LLMs) under resource-constrained conditions, this paper proposes a hardware-aware end-to-end framework that jointly optimizes model compression and efficient fine-tuning. Methodologically, it introduces the first integration of Low-Rank Adaptation (LoRA) into differentiable Neural Architecture Search (NAS), enabling automatic discovery of optimal lightweight adapter architectures within a weight-sharing supernet. By unifying low-rank matrix decomposition with Parameter-Efficient Fine-Tuning (PEFT), the framework achieves synergistic optimization of inference latency reduction and memory footprint compression. Extensive experiments demonstrate that the compressed models retain ≥98% of the original performance across multiple benchmark tasks, while substantially decreasing GPU memory consumption and inference latency. The proposed method thus bridges architectural efficiency and hardware constraints without compromising accuracy. All code and trained models are publicly released.

Technology Category

Application Category

📝 Abstract

The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Resource Consumption

Performance Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Adapter

Neural Architecture Search

Model Compression

🔎 Similar Papers

Compressing Large Language Models with Automated Sub-Network Search