VSD2M: A Large-scale Vision-language Sticker Dataset for Multi-frame Animated Sticker Generation

πŸ“… 2024-12-11
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
Social media users increasingly rely on retrieving rather than creating animated sticker content, primarily due to the functional limitations of existing tools and the lack of data and methods tailored for high-quality, customizable generation under low-frame-rate and highly abstract semantic constraints. Method: We propose the first systematic solution for Animated Sticker Generation (ASG), comprising (i) VSD2Mβ€”a large-scale, multimodal sticker dataset of 2 million samples, including both static and multi-frame animated stickers; (ii) a Spatio-Temporal Interaction (STI) layer that enhances semantic coherence and fine-grained detail preservation in discrete, low-frame-rate video generation; and (iii) the first dedicated ASG benchmark. Results: Integrating the STI layer into state-of-the-art video generation models yields a 23% reduction in FrΓ©chet Video Distance (FVD) and an 18% improvement in CLIP-Score, significantly advancing controllable generation of abstract animated content.

Technology Category

Application Category

πŸ“ Abstract
As a common form of communication in social media,stickers win users' love in the internet scenarios, for their ability to convey emotions in a vivid, cute, and interesting way. People prefer to get an appropriate sticker through retrieval rather than creation for the reason that creating a sticker is time-consuming and relies on rule-based creative tools with limited capabilities. Nowadays, advanced text-to-video algorithms have spawned numerous general video generation systems that allow users to customize high-quality, photo-realistic videos by only providing simple text prompts. However, creating customized animated stickers, which have lower frame rates and more abstract semantics than videos, is greatly hindered by difficulties in data acquisition and incomplete benchmarks. To facilitate the exploration of researchers in animated sticker generation (ASG) field, we firstly construct the currently largest vision-language sticker dataset named VSD2M at a two-million scale that contains static and animated stickers. Secondly, to improve the performance of traditional video generation methods on ASG tasks with discrete characteristics, we propose a Spatial Temporal Interaction (STI) layer that utilizes semantic interaction and detail preservation to address the issue of insufficient information utilization. Moreover, we train baselines with several video generation methods (e.g., transformer-based, diffusion-based methods) on VSD2M and conduct a detailed analysis to establish systemic supervision on ASG task. To the best of our knowledge, this is the most comprehensive large-scale benchmark for multi-frame animated sticker generation, and we hope this work can provide valuable inspiration for other scholars in intelligent creation.
Problem

Research questions and friction points this paper is trying to address.

Creating animated stickers is hindered by data scarcity and incomplete benchmarks
Existing video generation methods underperform on abstract, low-frame-rate stickers
Lack of large-scale vision-language datasets for animated sticker generation research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs largest vision-language sticker dataset VSD2M
Proposes Spatial Temporal Interaction layer STI
Trains baselines with video generation methods
πŸ”Ž Similar Papers
No similar papers found.