RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

📅 2024-05-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the prevalent issues of multimodal disentanglement and cross-modal inconsistency in existing text-to-singing generation systems. Methodologically, it introduces the first end-to-end framework for jointly generating singing vocals and 3D full-body motion from text. Specifically: (1) it constructs RapVerse, the first large-scale synchronized multimodal rap dataset; (2) it proposes a unified discrete modeling framework—employing VQ-VAE for motion quantization and a speech unit model to extract phonetic, prosodic, and speaker identity features; and (3) it adopts a multimodal autoregressive Transformer for joint sequence modeling. Experiments demonstrate that the framework significantly improves temporal alignment and semantic consistency between vocal and motion outputs—without compromising unimodal quality—achieving performance on par with state-of-the-art specialized single-modal systems. This establishes a new benchmark for cross-modal singing generation.

Technology Category

Application Category

📝 Abstract

In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at https://vis-www.cs.umass.edu/RapVerse.

Problem

Research questions and friction points this paper is trying to address.

Generating synchronized singing vocals and 3D body motions from text lyrics

Unifying language, audio, and motion modalities through multimodal transformers

Establishing benchmarks for joint vocal-motion generation surpassing isolated approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive multimodal transformers unify language, audio, motion

Vector-quantized autoencoders tokenize motion and vocal sequences

Joint modeling ensures seamless blend of vocals and body motions

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

2024-10-09arXiv.orgCitations: 3

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

Research Scientist Intern, Multimodal AI (PhD)