StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge of real-time streaming audiovisual character generation by simultaneously ensuring speech–text alignment, cross-segment visual consistency, and low-latency constraints. The authors propose a decoupled architecture comprising an LLM-driven coordinator that produces frame-level aligned audio conditions and employs a progress-aware pointer to maintain text–speech synchronization. A joint audiovisual DiT model performs localized bidirectional denoising within short temporal windows, augmented with a sink-token memory mechanism to suppress visual drift. Efficient deployment is achieved through a two-stage distillation strategy. Evaluated on a single H100 GPU, the method achieves real-time performance and outperforms existing baselines in text fidelity, audiovisual synchronization, visual quality, and streaming stability.

📝 Abstract

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

Problem

Research questions and friction points this paper is trying to address.

streaming character animation

audio-video generation

long-horizon consistency

real-time generation

visual identity preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming character generation

decoupled orchestration

audio-video DiT