Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

📅 2024-10-20
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in speech-text cross-modal joint understanding and generation—namely, ineffective modality fusion, high latency, and architectural complexity—by proposing an end-to-end hybrid-modality large language model tailored for voice assistant scenarios. The core innovation is a novel early-fusion architecture based on discrete speech tokenization: a quantized speech tokenizer unifies speech and text representations, eliminating the need for multimodal adapters; combined with multilingual ASR pretraining, instruction tuning, and a fully shared Transformer across modalities, it enables seamless cross-modal joint inference. Evaluated on speech QA benchmarks, the model achieves state-of-the-art open-source performance, with a first-token latency of only 111 ms—matching the accuracy of cascaded systems while significantly improving real-time responsiveness and modeling consistency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.
Problem

Research questions and friction points this paper is trying to address.

Integrating audio and text modalities for speech tasks
Enabling joint reasoning across speech and text
Reducing latency in speech-language model responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Early-fusion tokenized speech and text processing
Uniform transformer architecture for both modalities
Low-latency real-time speech question answering
🔎 Similar Papers
No similar papers found.