OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-source multimodal systems face limitations in real-time expressive speech synthesis and zero-shot cross-modal (vision-text-audio) alignment due to data scarcity and modeling complexity. This paper introduces OpenOmni, an open-source large multimodal model featuring a novel two-stage training paradigm: (i) zero-shot cross-modal alignment via vision-to-speech transfer learning, and (ii) emotion-controllable speech generation using a lightweight autoregressive decoder fine-tuned with human preference-based reinforcement learning. Crucially, OpenOmni requires no tri-modal supervised annotations. It achieves state-of-the-art performance across vision-language understanding, speech-language modeling, and end-to-end expressive speech synthesis—outperforming fully supervised baselines in comprehensive multimodal evaluation. The system enables natural, high-fidelity real-time dialogue with end-to-end latency under 300 ms, thereby overcoming both the data bottleneck and the dominance of closed-source solutions.

Technology Category

Application Category

📝 Abstract
Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.
Problem

Research questions and friction points this paper is trying to address.

Real-time Emotional Voice Generation
Cross-modal Understanding
Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenOmni
Multimodal Learning
Real-time Emotional Speech Generation
🔎 Similar Papers