Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the limitation of existing end-to-end automatic speech recognition systems, whose fixed-depth acoustic encoders cannot leverage increased computation during inference to improve performance. The authors propose LARM, the first method to introduce test-time compute scaling into non-autoregressive speech recognition by designing a depth-conditioned recurrent Transformer architecture. The model dynamically adjusts the number of recurrent layers at inference time through structured recurrence, parameter sharing, and several novel mechanisms—including sparse CTC checkpoints, supervised clock embeddings, FiLM-based depth conditioning, and delayed soft posterior feedback. Experiments on LibriSpeech demonstrate that word error rate consistently decreases with more inference iterations, achieving performance comparable to significantly deeper baseline models with non-shared parameters.
📝 Abstract
End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse a shared Transformer block recurrently, but we find that naive looping does not fully exploit additional recurrent compute. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback. These components structure the loop into recognition checkpoints separated by latent refinement phases and allow shared weights to specialize across recurrent steps. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines. Our results show that test-time compute scaling can extend beyond autoregressive language-model reasoning to continuous non-autoregressive speech recognition.
Problem

Research questions and friction points this paper is trying to address.

test-time compute scaling
automatic speech recognition
looped Transformers
inference efficiency
depth conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time compute scaling
looped Transformer
depth conditioning
non-autoregressive ASR
LARM
🔎 Similar Papers
No similar papers found.