LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation

📅 2024-04-21

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

Existing text-to-image models struggle to generate semantically aligned, motion-coherent, fine-grained animations from static images under zero-shot, no-fine-tuning conditions. To address this, we propose the first LLM-driven, training-free image-to-animation framework: a large language model parses textual instructions to guide controllable injection into pre-trained diffusion models—both at the attention and feature levels. Our method integrates initial noise inversion, interpolated text embeddings, and cross-frame noise consistency constraints to ensure semantic stability and morphological continuity. Crucially, it requires no fine-tuning or additional training. Evaluated on our newly constructed Text-conditioned Image-to-Animation Benchmark, our approach significantly improves inter-frame consistency and text fidelity over prior methods. It enables diverse, high-fidelity, zero-shot animation synthesis while preserving input image semantics and adhering strictly to user-provided motion directives.

Technology Category

Application Category

📝 Abstract

Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, such as text-conditioned image editing, enabling the modification of existing images based on textual guidance. This capability allows for the generation of diverse images that convey highly complex visual concepts. However, existing methods primarily focus on generating new images from text-image pairs and struggle to produce fine-grained animations from existing images and textual guidance without fine-tuning. In this paper, we introduce LASER, a tuning-free LLM-driven attention control framework that follows a progressive process: LLM planning, feature-attention injection, and stable animation generation. LASER leverages a large language model (LLM) to refine general descriptions into fine-grained prompts, guiding pre-trained text-to-image models to generate aligned keyframes with subtle variations. The LLM also generates control signals for feature and attention injections, enabling seamless text-guided image morphing for various transformations without additional fine-tuning. By using the same initial noise inversion from the input image, LASER receives LLM-controlled injections during denoising and leverages interpolated text embeddings to produce a series of coherent animation frames. We propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER. Extensive experiments demonstrate that LASER achieves impressive results in consistent and efficient animation generation, establishing it as a powerful tool for producing detailed animations and opening new avenues in digital content creation.

Problem

Research questions and friction points this paper is trying to address.

Generates fine-grained animations from images without fine-tuning

Controls attention in text-to-image models via LLM planning

Enables text-guided image morphing for diverse transformations

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven attention control framework

Tuning-free text-guided image morphing

Stable coherent animation generation

🔎 Similar Papers

No similar papers found.

Authors to Follow