LaSR: Context-Aware Speech Recognition via Latent Reasoning

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Current speech large language models exhibit limitations in recognizing domain-specific terminology and context-sensitive vocabulary due to inadequate modeling of speaker intent and topical context. This work proposes LaSR, a novel training paradigm that aligns chain-of-thought supervision signals with the acoustic regions of target words and incorporates an implicit reasoning phase to fuse contextual information, enabling context-aware speech recognition without additional inference latency. LaSR introduces the first implicit reasoning trajectory mechanism, circumventing the need for explicit intermediate token generation, and presents Spoken Darwin-Science—the first large-scale spoken corpus focused on academic terminology. Experimental results demonstrate that LaSR significantly outperforms standard fine-tuning baselines on the Fun-Audio-Chat dataset, achieving substantial gains in technical term recognition accuracy while maintaining zero added inference delay.

📝 Abstract

Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that effectively reflects the speaker's intent and topical context. In this paper, we propose LaSR (Latent Speech Reasoning), a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition. Furthermore, to effectively benchmark contextual recognition on specialized vocabulary, we propose Spoken Darwin-Science, a large-scale corpus focusing on academic terminologies. Preliminary experiments on Fun-Audio-Chat demonstrate that LaSR significantly improves terminology recognition without introducing additional latency and consistently outperforms standard supervised fine-tuning baselines. Our findings highlight the potential of latent reasoning in building efficient, context-aware speech assistants.

Problem

Research questions and friction points this paper is trying to address.

context-aware speech recognition

speech large language models

spoken language understanding

terminology recognition

contextual awareness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Reasoning

Context-Aware Speech Recognition

Speech LLMs