AuRA: Internalizing Audio Understanding into LLMs as LoRA

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing speech-language models face challenges such as transcription latency, high training costs, or loose modality coupling. This work proposes a tightly coupled end-to-end approach that simultaneously feeds speech into both a pretrained ASR encoder (teacher) and a LoRA-finetuned large language model (LLM, student) via a lightweight audio embedding layer. Layer-wise knowledge distillation is employed to align hidden states, thereby internalizing speech representations directly into the LLM’s LoRA adaptation modules for the first time. The method significantly outperforms cascaded systems, speech-to-LLM adaptation baselines, and large multimodal models across multiple speech-language benchmarks, achieving state-of-the-art results in both inference efficiency and task performance.

📝 Abstract

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.

Problem

Research questions and friction points this paper is trying to address.

speech-language integration

ASR-LLM pipeline

multimodal training

speech representation

model latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

AuRA

LoRA

knowledge distillation