Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the performance degradation in large language model (LLM)-driven spoken dialogue systems caused by the mismatch between the temporal granularities of speech and text. Framing speech representation design as a representation selection problem, the study systematically investigates the impact of frame rate and alignment depth on spoken question answering under the constraints of a frozen LLM backbone and fixed information rate. To enable high-bitrate semantic representations at low frame rates, the authors propose factorized finite scalar quantization (FSQ) combined with a lightweight non-autoregressive audio-language model head. Experimental results demonstrate that a frame rate of 4.17 Hz aligned with intermediate LLM layers yields optimal performance, significantly narrowing the gap between speech and text inputs and validating the efficacy of low-frame-rate, high-semantic-density representations.

📝 Abstract

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.

Problem

Research questions and friction points this paper is trying to address.

speech-text alignment

temporal granularity

speech representation

frame rate

text-native reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-text alignment

frame rate optimization

factorized FSQ

non-autoregressive audio LM

semantic density

🔎 Similar Papers

Prosody Analysis of Audiobooks

2023-10-10arXiv.orgCitations: 0

SSR: Alignment-Aware Modality Connector for Speech Language Models

2024-09-30arXiv.orgCitations: 3