🤖 AI Summary
Efficiently leveraging pre-trained language model experts for knowledge-intensive tasks remains challenging, particularly under data-free inference constraints. Method: This paper proposes LoRA-Augmented Generation (LAG), a parameter-efficient multi-expert collaborative inference framework that requires no additional training or data access. LAG constructs a large-scale, task-specific library of LoRA adapters and employs retrieval-augmented dynamic routing to fuse multiple LoRA experts layer-wise and token-wise during generation, enabling fine-grained knowledge scheduling. It seamlessly integrates with RAG-style setups when external data is available, overcoming performance bottlenecks of conventional data-free approaches. Contribution/Results: Extensive experiments demonstrate that LAG consistently outperforms state-of-the-art data-free methods in both zero-data and data-available settings, validating its strong generalization capability and practical utility across diverse knowledge-intensive tasks.
📝 Abstract
The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG's compatibility with alternative solutions such as retrieval-augmented generation (RAG).