🤖 AI Summary
Existing concept control methods rely on model retraining, incurring high computational costs and poor generalizability across diffusion models. This work proposes a lightweight, plug-and-play continuous concept control framework that requires no fine-tuning of either the diffusion model or the text encoder. It identifies low-rank semantic directions within a frozen pre-trained CLIP text encoder and employs LoRA adapters to enable learnable, continuous sliding control in the text embedding space. The approach supports multi-concept composition and layout preservation, and is natively compatible with diverse text-to-image and text-to-video diffusion models. Experiments demonstrate that our method trains 5× faster than Concept Slider and 47× faster than Attribute Control, while reducing GPU memory consumption by 2–4×. These improvements significantly enhance scalability and real-time editing capability.
📝 Abstract
Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5$ imes$ faster training than Concept Slider and 47$ imes$ faster than Attribute Control, while reducing GPU memory usage by nearly 2$ imes$ and 4$ imes$, respectively.