Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes ELF-S2T, a novel framework that unifies automatic speech recognition (ASR) and speech-to-text translation (S2TT) by modeling both tasks in a shared continuous semantic space. Departing from conventional systems that rely on discrete textual outputs, ELF-S2T introduces continuous target-language modeling through a pretrained Embedded Language Flows (ELF) backbone coupled with a frozen Whisper encoder. The approach leverages audio-aligned training and classifier-free guidance to strengthen audio-conditioned control over continuous generation. Experiments on LibriSpeech and CoVoST2 demonstrate competitive performance, while error analysis reveals that mistakes in both ASR and S2TT stem from semantic ambiguities in the continuous latent space—highlighting a common underlying error mechanism between the two tasks.

📝 Abstract

Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.

Problem

Research questions and friction points this paper is trying to address.

speech recognition

speech translation

continuous representation

latent space

discrete vs continuous generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

continuous-target generation

audio-conditioned diffusion

flow matching