SparQLe: Speech Queries to Text Translation Through LLMs

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses end-to-end semantic translation from speech queries to text, eliminating the conventional ASR intermediate step. The proposed method introduces learnable modality adapters that directly bridge self-supervised speech encoders (e.g., wav2vec 2.0) and instruction-tuned large language models (e.g., Llama-2/3), mapping speech representations to semantically coherent text outputs. Crucially, it enables joint optimization of self-supervised speech representations and instruction-tuned LLMs—achieved for the first time—via end-to-end training on English speech–text alignment data. Experiments on multi-task speech understanding benchmarks demonstrate substantial improvements over cascaded ASR+LLM systems: +23% in semantic fidelity and −40% in inference latency. These results validate the effectiveness and efficiency of cross-modal semantic alignment for direct speech-to-text translation.

Technology Category

Application Category

📝 Abstract
With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English-language data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising solution for various speech understanding applications.
Problem

Research questions and friction points this paper is trying to address.

Integrating speech with LLMs
Speech-to-text translation
Aligning speech features with LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised speech representations integration
Modality adapter for feature alignment
Speech-to-text translation with LLMs
🔎 Similar Papers
No similar papers found.