Input Conditioned Layer Dropping in Speech Foundation Models

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of adapting pre-trained speech foundation models to dynamic computational resources in edge and IoT environments, this paper proposes an input-driven lightweight layer-skipping mechanism. Without altering the pre-trained model architecture, our approach employs a lightweight selection network that jointly models input features to dynamically and individually determine which layers to execute during inference—enabling input-conditioned adaptive computation. Unlike existing layer-dropping methods, ours avoids architectural redesign and eliminates coarse-grained, stochastic skipping. Experiments on four public speech benchmarks demonstrate that our method significantly reduces computational load while maintaining or even surpassing the accuracy of baselines such as early exiting. These results validate both its efficiency and seamless compatibility with existing pre-trained models.

Technology Category

Application Category

📝 Abstract
Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $mathcal{LD}$ that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.
Problem

Research questions and friction points this paper is trying to address.

Dynamic layer dropping for edge IoT speech models
Input-driven layer selection to optimize computation
Improving efficiency without sacrificing model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Input-driven layer dropping for dynamic computation
Lightweight network selects optimal processing layers
Outperforms random dropping and early exit
🔎 Similar Papers
No similar papers found.
A
Abdul Hannan
University of Trento, Italy
D
Daniele Falavigna
Fondazione Bruno Kessler, Italy
Alessio Brutti
Alessio Brutti
FBK
audio/speech processing