EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing subject-driven generation methods face a fundamental trade-off between efficiency and zero-shot capability: fine-tuning paradigms are computationally expensive, while diffusion-based approaches suffer from slow inference. This paper introduces the first feed-forward subject-driven generation framework based on Vision Autoregressive (VAR) modeling. Its core innovation is a dual-path injection mechanism that decouples high-level semantic identity from low-level detailed features, implemented via a semantic/content dual-encoder architecture. This design integrates decoupled cross-attention and multimodal attention to enable precise, disentangled control over identity and appearance. The method supports zero-shot transfer without retraining, achieving image quality and subject fidelity comparable to state-of-the-art diffusion models while drastically reducing sampling latency. Extensive quantitative and qualitative experiments demonstrate its superior efficiency, fine-grained controllability, and strong generalization across diverse subjects and poses.

Technology Category

Application Category

📝 Abstract
Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Enables subject-driven generation without fine-tuning
Resolves trade-off between generation quality and speed
Disentangles semantic identity from visual details for control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feed-forward subject-driven auto-regressive model for visual echoes
Dual-path injection disentangles semantic identity and fine details
Multi-modal attention mechanism ensures high-fidelity texture preservation
🔎 Similar Papers
No similar papers found.
Ruixiao Dong
Ruixiao Dong
University of Science and Technology of China
Computer VisionImage and Video Generation
Z
Zhendong Wang
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
K
Keli Liu
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
L
Li Li
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Y
Ying Chen
Alibaba Group - Taobao & Tmall Group
K
Kai Li
Alibaba Group - Taobao & Tmall Group
D
Daowen Li
Alibaba Group - Taobao & Tmall Group
Houqiang Li
Houqiang Li
Professor, Department of Electric Engineering and Information Science, University of Science and
Multimedia SearchImage/Video AnalysisImage/Video Coding