AuRA: Internalizing Audio Understanding into LLMs as LoRA

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-language models face challenges such as transcription latency, high training costs, or loose modality coupling. This work proposes a tightly coupled end-to-end approach that simultaneously feeds speech into both a pretrained ASR encoder (teacher) and a LoRA-finetuned large language model (LLM, student) via a lightweight audio embedding layer. Layer-wise knowledge distillation is employed to align hidden states, thereby internalizing speech representations directly into the LLM’s LoRA adaptation modules for the first time. The method significantly outperforms cascaded systems, speech-to-LLM adaptation baselines, and large multimodal models across multiple speech-language benchmarks, achieving state-of-the-art results in both inference efficiency and task performance.
📝 Abstract
Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.
Problem

Research questions and friction points this paper is trying to address.

speech-language integration
ASR-LLM pipeline
multimodal training
speech representation
model latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

AuRA
LoRA
knowledge distillation
speech-language modeling
end-to-end inference
🔎 Similar Papers
No similar papers found.