🤖 AI Summary
Text-to-data generation in low-resource domains—such as molecular conformations, motion trajectories, and multivariate time series—is severely hindered by the scarcity of text annotations. Method: We propose an unsupervised diffusion modeling framework: (i) a generic diffusion model pretrained solely on unlabeled data; (ii) a constraint-optimized controllable fine-tuning mechanism that preserves original generative capacity while enabling fine-grained text conditioning; and (iii) a text–latent space alignment strategy to enhance semantic consistency. Contribution/Results: To our knowledge, this is the first approach to achieve high-fidelity, text-driven generation across multiple low-resource multimodal domains without relying on labeled paired data. It significantly outperforms existing supervised and weakly supervised baselines on molecular conformation generation, human motion synthesis, and multivariate time-series forecasting, while effectively mitigating catastrophic forgetting. Our framework establishes a novel paradigm for cross-modal generation under fully unsupervised conditions.
📝 Abstract
Natural language serves as a common and straightforward signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.