Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing concept control methods rely on model retraining, incurring high computational costs and poor generalizability across diffusion models. This work proposes a lightweight, plug-and-play continuous concept control framework that requires no fine-tuning of either the diffusion model or the text encoder. It identifies low-rank semantic directions within a frozen pre-trained CLIP text encoder and employs LoRA adapters to enable learnable, continuous sliding control in the text embedding space. The approach supports multi-concept composition and layout preservation, and is natively compatible with diverse text-to-image and text-to-video diffusion models. Experiments demonstrate that our method trains 5× faster than Concept Slider and 47× faster than Attribute Control, while reducing GPU memory consumption by 2–4×. These improvements significantly enhance scalability and real-time editing capability.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5$ imes$ faster training than Concept Slider and 47$ imes$ faster than Attribute Control, while reducing GPU memory usage by nearly 2$ imes$ and 4$ imes$, respectively.

Problem

Research questions and friction points this paper is trying to address.

Existing concept control methods require intensive training time and GPU memory

Current approaches need retraining for different diffusion model backbones

Limited scalability and adaptability of visual concept control techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA adapters for plug-and-play concept control

Low-rank directions in pre-trained text encoder

Efficient multi-concept composition for image/video synthesis

🔎 Similar Papers

No similar papers found.