Scaling-up Perceptual Video Quality Assessment

๐Ÿ“… 2025-05-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In video quality assessment (VQA), the scarcity of high-quality human annotations and limited data scale hinder the realization of scaling lawsโ€™ potential. Method: This paper proposes OmniVQA, a human-in-the-loop framework that introduces the first large-scale instruction-tuning paradigm for VQA. It constructs OmniVQA-Chat-400Kโ€”the largest multimodal instruction dataset to dateโ€”and OmniVQA-MOS-20K, a quantitative Mean Opinion Score (MOS) dataset, jointly covering technical and aesthetic dimensions with fine-grained granularity. The framework integrates fine-grained prompt engineering, complementary multi-task learning, and the OmniVQA-FG benchmark to unify fine-grained quality understanding and MOS prediction. Contribution/Results: Experiments demonstrate state-of-the-art performance on both tasks; OmniVQA-FG further validates superior fine-grained discriminative capability, establishing new benchmarks for scalable, interpretable VQA.

Technology Category

Application Category

๐Ÿ“ Abstract
The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose extbf{OmniVQA}, an efficient framework designed to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases (MIDBs). We then scale up to create extbf{OmniVQA-Chat-400K}, the largest MIDB in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we have built the extbf{OmniVQA-MOS-20K} dataset to enhance the model's quantitative quality rating capabilities. We then introduce a extbf{complementary} training strategy that effectively leverages the knowledge from datasets for quality understanding and quality rating tasks. Furthermore, we propose the extbf{OmniVQA-FG (fine-grain)-Benchmark} to evaluate the fine-grained performance of the models. Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of labeled data in video quality assessment
Scaling up multi-modal datasets for perceptual VQA tasks
Enhancing model performance in quality understanding and rating
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniVQA framework for multi-modal VQA databases
Complementary training strategy for quality tasks
OmniVQA-FG-Benchmark for fine-grained evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
Ziheng Jia
Ziheng Jia
Shanghai Jiaotong University / Shanghai AILab
LLM and LMM on Visual Quality Assessment
Z
Zicheng Zhang
Shanghai Jiaotong University
Z
Zeyu Zhang
Shanghai Jiaotong University
Y
Yingji Liang
East China Normal University
X
Xiaorong Zhu
Shanghai Jiaotong University
Chunyi Li
Chunyi Li
NTU | SJTU | Shanghai AI Lab
Generative AIEmbodied AILow-level Vision
J
Jinliang Han
Shanghai Jiaotong University
Haoning Wu
Haoning Wu
Shanghai Jiao Tong University
Computer VisionMulti-modal LearningGenerative Models
B
Bin Wang
Media Experience and Evaluation Lab, Huawei Techonologies
H
Haoran Zhang
Media Experience and Evaluation Lab, Huawei Techonologies
Guanyu Zhu
Guanyu Zhu
Research Staff Member, IBM T.J. Watson Research Center
Quantum informationquantum error correctionquantum mattertopological orderquantum complexity
Q
Qiyong Zhao
Media Experience and Evaluation Lab, Huawei Techonologies
X
Xiaohong Liu
Shanghai Jiaotong University
X
Xiongkuo Min
Shanghai Jiaotong University
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays