PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint

📅 2025-09-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Full-parameter fine-tuning hinders efficient deployment and scalable adaptation of large models. Method: We propose a training- and gradient-free posterior low-rank adapter extraction method, which directly decomposes the weight delta between pre- and post-fine-tuned models into structured low-rank updates, and integrates them with S-LoRA for dynamic routing or on-the-fly merging during inference. Contribution/Results: This is the first approach to fully decouple fine-tuning from adapter generation, enabling plug-and-play application to arbitrary full-fine-tuned checkpoints—including third-party models. Evaluated across text, image, and video modalities, the extracted adapters preserve over 99% of the original weight delta information, with negligible performance degradation (average drop <0.3%). The method supports secure pruning and industrial-grade inference platforms (e.g., NVIDIA NIM), significantly reducing GPU memory overhead and serving costs.

Technology Category

Application Category

📝 Abstract

We introduce PHLoRA (Pronounced "flora"). (Post-hoc LoRA), a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models without requiring access to training data or gradients. By computing the low-rank decomposition of weight differences between a base model and its fine-tuned counterpart, our method reconstructs adapter modules that can be merged or dynamically routed at inference time via S-LoRA, or served in scalable, industry settings using platforms like NVIDIA NIM. This approach amortizes latency overhead across requests and yields substantial cost savings. Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation, allowing adapter extraction from existing full-rank models or third-party checkpoints. Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged. Overall, PHLoRA provides a practical path for making all existing full-rank checkpoints adapter-ready, democratizing scalable inference for all models.

Problem

Research questions and friction points this paper is trying to address.

Extract LoRA adapters without training data

Decouple fine-tuning from adapter generation

Enable scalable inference for existing models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts adapters without training data

Uses low-rank decomposition of weight differences

Enables scalable inference via dynamic routing

🔎 Similar Papers

No similar papers found.