VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

📅 2023-04-17
🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence
📈 Citations: 76
Influential: 11
📄 PDF
🤖 AI Summary
This work addresses the challenge of joint modeling across vision, audio, and language modalities in multimodal perception. We propose the first end-to-end vision–audio–language unified pretraining framework. Methodologically, we design a three-encoder–one-decoder architecture integrating contrastive learning and autoregressive generation. We introduce two novel pretraining tasks: Multimodal Group-wise Alignment (MGA) for fine-grained cross-modal alignment, and Group-wise Captioning (MGC) for controllable multimodal generation. Furthermore, we construct and publicly release VALOR-1M, a high-quality, million-scale audiovisual captioning dataset. Experiments demonstrate state-of-the-art performance across diverse downstream tasks—including cross-modal retrieval, audiovisual captioning, and multimodal question answering—achieving new SOTA results on AudioCaps, Clotho, and VGGSound QA. All code and data are open-sourced.
📝 Abstract
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language pretraining models, VALOR jointly models the relationships among vision, audio, and language in an end-to-end manner. It consists of three separate encoders for single modality representations and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain the VALOR model: Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language, and audio into the same common space, simultaneously building vision-language, audio-language, and audiovisual-language alignment. MGC learns to generate text tokens under conditions of vision, audio, or both. To promote vision-audio-language pretraining research, we construct a large-scale, high-quality tri-modality dataset named VALOR-1M, containing 1 million audible videos with human-annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and generalize to various downstream tasks (e.g., retrieval, captioning, and question answering) with different input modalities (e.g., vision-language, audio-language, and audiovisual-language). VALOR achieves new state-of-the-art performance on a series of public cross-modality benchmarks. Code and data are available on the project page at https://casia-iva-group.github.io/projects/VALOR.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Learning
Cross-modal Association
Artificial Intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Understanding
Cross-modal Association Learning
Large-scale Video Dataset
🔎 Similar Papers
No similar papers found.
S
Sihan Chen
School of Artificial Intelligence, University of Chinese Academy of Sciences and National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Xingjian He
Xingjian He
Institute of Automation of the Chinese Academy Sciences (CASIA)
computer visionsemantic segmentation
L
Longteng Guo
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
X
Xinxin Zhu
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
W
Weining Wang
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
J
Jinhui Tang
Nanjing University of Science and Technology, School of Computer Science and Engineering
J
Jing Liu
School of Artificial Intelligence, University of Chinese Academy of Sciences and National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences