Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing efficiency and accuracy in automatic speech recognition (ASR) models deployed on edge devices—where computational resources are both severely constrained and dynamically varying—this paper proposes a dynamically adjustable early-exit architecture. The method introduces two key innovations: (1) a parallel downsampling branch to enhance representational capacity of early-exit layers, and (2) integration of Zipformer’s variable-frame-rate modeling, leveraging lightweight up/down-sampling modules and multi-granularity exit paths to jointly optimize inference efficiency. Evaluated on standard benchmarks, the approach achieves significant improvements in recognition accuracy with negligible increases in parameter count and real-world latency. These results demonstrate its effectiveness for deployment in resource-sensitive edge scenarios.

Technology Category

Application Category

📝 Abstract
The ability to dynamically adjust the computational load of neural models during inference in a resource aware manner is crucial for on-device processing scenarios, characterised by limited and time-varying computational resources. Early-exit architectures represent an elegant and effective solution, since they can process the input with a subset of their layers, exiting at intermediate branches (the upmost layers are hence removed from the model). From a different perspective, for automatic speech recognition applications there are memory-efficient neural architectures that apply variable frame rate analysis, through downsampling/upsampling operations in the middle layers, reducing the overall number of operations and improving significantly the performance on well established benchmarks. One example is the Zipformer. However, these architectures lack the modularity necessary to inject early-exit branches. With the aim of improving the performance in early-exit models, we propose introducing parallel layers in the architecture that process downsampled versions of their inputs. % in conjunction with standard processing layers. We show that in this way the speech recognition performance on standard benchmarks significantly improve, at the cost of a small increase in the overall number of model parameters but without affecting the inference time.
Problem

Research questions and friction points this paper is trying to address.

Dynamic computational load adjustment for edge device ASR
Enhancing early-exit models with parallel downsampling layers
Improving speech recognition performance without increasing inference time
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-exit architecture for dynamic computation adjustment
Parallel layers processing downsampled inputs
Improved speech recognition with minimal parameter increase
🔎 Similar Papers
No similar papers found.