Scaling-up Perceptual Video Quality Assessment

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In video quality assessment (VQA), the scarcity of high-quality human annotations and limited data scale hinder the realization of scaling laws’ potential. Method: This paper proposes OmniVQA, a human-in-the-loop framework that introduces the first large-scale instruction-tuning paradigm for VQA. It constructs OmniVQA-Chat-400K—the largest multimodal instruction dataset to date—and OmniVQA-MOS-20K, a quantitative Mean Opinion Score (MOS) dataset, jointly covering technical and aesthetic dimensions with fine-grained granularity. The framework integrates fine-grained prompt engineering, complementary multi-task learning, and the OmniVQA-FG benchmark to unify fine-grained quality understanding and MOS prediction. Contribution/Results: Experiments demonstrate state-of-the-art performance on both tasks; OmniVQA-FG further validates superior fine-grained discriminative capability, establishing new benchmarks for scalable, interpretable VQA.

Technology Category

Application Category

📝 Abstract

The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose extbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create extbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the extbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a extbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the extbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of labeled data in video quality assessment

Scaling up multi-modal datasets for perceptual VQA tasks

Enhancing model performance in quality understanding and rating

Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniVQA framework for multi-modal VQA databases

Complementary training strategy for quality tasks

OmniVQA-FG-Benchmark for fine-grained evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow