MURMUR: An Efficient Inference System for Long-Form ASR

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenge of achieving both high accuracy and low latency in long-form automatic speech recognition (ASR). The authors propose MURMUR, a novel system that integrates an adaptive block-size chunked pipeline with a sliding-window key-value (KV) cache eviction mechanism to optimize inference efficiency at both inter-chunk and intra-chunk levels. The former eliminates reliance on heuristic boundary alignment common in conventional approaches, while the latter exploits attention sparsity to reduce redundant computation. MURMUR maintains single-pass inference accuracy while substantially lowering latency. On the AMI-IHM dataset, it achieves accuracy comparable to single-pass inference with a 4.2× reduction in latency; when combined with token pruning, it incurs less than 1% relative degradation in tcpWER.

📝 Abstract

Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.

Problem

Research questions and friction points this paper is trying to address.

long-form ASR

latency-accuracy trade-off

chunk-based pipeline

long-context modeling

automatic speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-form ASR

chunk-based pipeline

attention sparsity