๐ค AI Summary
In video quality assessment (VQA), the scarcity of high-quality human annotations and limited data scale hinder the realization of scaling lawsโ potential. Method: This paper proposes OmniVQA, a human-in-the-loop framework that introduces the first large-scale instruction-tuning paradigm for VQA. It constructs OmniVQA-Chat-400Kโthe largest multimodal instruction dataset to dateโand OmniVQA-MOS-20K, a quantitative Mean Opinion Score (MOS) dataset, jointly covering technical and aesthetic dimensions with fine-grained granularity. The framework integrates fine-grained prompt engineering, complementary multi-task learning, and the OmniVQA-FG benchmark to unify fine-grained quality understanding and MOS prediction. Contribution/Results: Experiments demonstrate state-of-the-art performance on both tasks; OmniVQA-FG further validates superior fine-grained discriminative capability, establishing new benchmarks for scalable, interpretable VQA.
๐ Abstract
The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose extbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create extbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the extbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a extbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the extbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.