HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the significant performance degradation of LoRA-finetuned large language models on RRAM-based compute-in-memory (CIM) architectures due to device-level non-idealities, this paper proposes HaLoRA—a hardware-software co-design framework. HaLoRA leverages an RRAM/SRAM hybrid CIM architecture to physically separate pre-trained weights (stored in SRAM) from LoRA adaptation parameters (mapped onto RRAM arrays), and introduces a hardware-aware dual-objective alignment training paradigm that jointly optimizes task accuracy under ideal conditions and robustness under realistic RRAM noise. This work presents the first efficient and robust LoRA deployment on hybrid CIM hardware. Experiments on LLaMA-3.2 1B and 3B models demonstrate that HaLoRA achieves an average +22.7 points improvement across multiple inference benchmarks over standard LoRA, while maintaining stable performance across diverse RRAM noise levels.

Technology Category

Application Category

📝 Abstract

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.

Problem

Research questions and friction points this paper is trying to address.

Hardware-aware Low-Rank Adaptation

Hybrid Compute-in-Memory Architecture

Performance degradation from RRAM noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-aware Low-rank Adaptation

Hybrid Compute-in-Memory Architecture

Robust training under noisy conditions

🔎 Similar Papers

No similar papers found.

Authors to Follow