SparQLe: Speech Queries to Text Translation Through LLMs

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses end-to-end semantic translation from speech queries to text, eliminating the conventional ASR intermediate step. The proposed method introduces learnable modality adapters that directly bridge self-supervised speech encoders (e.g., wav2vec 2.0) and instruction-tuned large language models (e.g., Llama-2/3), mapping speech representations to semantically coherent text outputs. Crucially, it enables joint optimization of self-supervised speech representations and instruction-tuned LLMs—achieved for the first time—via end-to-end training on English speech–text alignment data. Experiments on multi-task speech understanding benchmarks demonstrate substantial improvements over cascaded ASR+LLM systems: +23% in semantic fidelity and −40% in inference latency. These results validate the effectiveness and efficiency of cross-modal semantic alignment for direct speech-to-text translation.

Technology Category

Application Category

📝 Abstract

With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English-language data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising solution for various speech understanding applications.

Problem

Research questions and friction points this paper is trying to address.

Integrating speech with LLMs

Speech-to-text translation

Aligning speech features with LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised speech representations integration

Modality adapter for feature alignment

Speech-to-text translation with LLMs

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks