Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the substantial storage, communication, and computational overheads incurred by floating-point operations in conventional LoRA during on-device fine-tuning of large language models, despite its small parameter count. The authors propose LoRDBA—the first LoRA-compatible method integrating double binarization—replacing low-rank factors with sign matrices and recovering magnitude information via lightweight channel-wise scaling. LoRDBA achieves accuracy comparable to fp16 LoRA while compressing adapter size by over 10×. It incurs at most an 8% increase in prefill latency during inference and requires approximately 1.6× the training memory of fp16 LoRA, significantly outperforming existing low-bit baselines under the same model scale.

📝 Abstract

On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local updates, and acts as a unit of communication and hot-swapping.We introduce LoRDBA, a LoRA-compatible adapter that replaces both low-rank factors with binary sign carriers while representing magnitudes through lightweight, channel-wise scales, converting the dense adapter branch into two sign-accumulation matrix multiplications interleaved with channel-wise scaling. A finite-sample analysis shows that reconstruction quality is governed by the residual-to-magnitude ratio of the original LoRA factors. In adapter-mode experiments, LoRDBA outperforms low-bit baselines at matched model sizes while matching fp16 LoRA quality in selected regimes. The unmerged adapter incurs at most 8% prefill latency overhead at matched rank r=16 despite an over 10x reduction in adapter footprint, with moderate training memory overhead of approximately 1.6x that of fp16 LoRA.

Problem

Research questions and friction points this paper is trying to address.

on-device fine-tuning

low-rank adaptation

binary quantization

adapter efficiency

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

binary adaptation

low-rank approximation

on-device fine-tuning