🤖 AI Summary
This work addresses key challenges in speech-text cross-modal joint understanding and generation—namely, ineffective modality fusion, high latency, and architectural complexity—by proposing an end-to-end hybrid-modality large language model tailored for voice assistant scenarios. The core innovation is a novel early-fusion architecture based on discrete speech tokenization: a quantized speech tokenizer unifies speech and text representations, eliminating the need for multimodal adapters; combined with multilingual ASR pretraining, instruction tuning, and a fully shared Transformer across modalities, it enables seamless cross-modal joint inference. Evaluated on speech QA benchmarks, the model achieves state-of-the-art open-source performance, with a first-token latency of only 111 ms—matching the accuracy of cascaded systems while significantly improving real-time responsiveness and modeling consistency.
📝 Abstract
Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.