๐ค AI Summary
This work addresses the challenge of efficiently scaling cross-modal perception and generation across vision, speech, and language for Artificial General Intelligence (AGI). We propose a sparse unified multimodal architecture integrating sparse Mixture-of-Experts (MoE), context-aware automatic speech recognition (ASR), high-resolution controllable image generation, and generative segmentation to enable joint training and inference over all three modalities. Our key contributions are: (i) the first sparse architecture simultaneously supporting high-fidelity text rendering, cross-modal consistent editing, and dialect-robust ASR; and (ii) significantly improved spatial consistency in image editing via generative segmentation. Experiments demonstrate state-of-the-art performance on 12 ASR benchmarks, as well as new records on text-to-image generation and segmentation tasksโachieving this while maintaining computational efficiency and scalable model capacity.
๐ Abstract
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.