Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Existing connector-based unified video models struggle to jointly train high-fidelity generators, limiting generation quality. This work proposes the Lumos-Nexus framework, which aligns only a lightweight generator with the understanding module during training. At inference time, it employs a Unified Progressive Frequency Bridging (UPFB) mechanism to gradually transition—within a shared latent space—from the lightweight generator to a pre-trained, high-capacity generator. This enables semantically controllable yet high-fidelity video synthesis without compromising inference capabilities. The approach significantly enhances visual realism and temporal consistency while introducing VR-Bench, the first benchmark tailored for inference-driven video generation.
📝 Abstract
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.
Problem

Research questions and friction points this paper is trying to address.

video unified models
high-fidelity generation
computational cost
reasoning-driven video synthesis
visual quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Progressive Frequency Bridging
homogeneous latent space
reasoning-driven video generation
video unified models
training-efficient framework