🤖 AI Summary
This work addresses end-to-end semantic translation from speech queries to text, eliminating the conventional ASR intermediate step. The proposed method introduces learnable modality adapters that directly bridge self-supervised speech encoders (e.g., wav2vec 2.0) and instruction-tuned large language models (e.g., Llama-2/3), mapping speech representations to semantically coherent text outputs. Crucially, it enables joint optimization of self-supervised speech representations and instruction-tuned LLMs—achieved for the first time—via end-to-end training on English speech–text alignment data. Experiments on multi-task speech understanding benchmarks demonstrate substantial improvements over cascaded ASR+LLM systems: +23% in semantic fidelity and −40% in inference latency. These results validate the effectiveness and efficiency of cross-modal semantic alignment for direct speech-to-text translation.
📝 Abstract
With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English-language data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising solution for various speech understanding applications.