Domain Adaptation of VLM for Soccer Video Understanding

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose vision-language models (VLMs) exhibit limited transferability to domain-specific sports video understanding. Method: This paper proposes a curriculum learning-based adaptation paradigm for vertical domains, using football as a case study. It (1) leverages large language models (LLMs) to automatically generate high-quality football video instruction data; (2) designs a staged curriculum progressing from conceptual recognition to complex reasoning; and (3) integrates football-specific video clip annotations with multimodal alignment fine-tuning. Results: On football visual question answering, the method achieves a 37.5% relative improvement; action classification accuracy rises significantly from 11.8% to 63.5%. This work is the first to systematically demonstrate that open-source VLMs can be efficiently adapted to specialized domains via lightweight domain data and curriculum learning—establishing a reusable methodological framework for vertical-domain VLM customization.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.
Problem

Research questions and friction points this paper is trying to address.

Exploring VLM adaptability to specialized domains like soccer
Improving soccer video understanding via domain-specific fine-tuning
Enhancing VLM performance on soccer tasks with curated data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large-scale soccer datasets for domain adaptation
Employs LLM to create instruction-following training data
Applies curriculum learning for iterative VLM fine-tuning
🔎 Similar Papers
No similar papers found.
T
Tiancheng Jiang
Massachusetts Institute of Technology
Henry Wang
Henry Wang
Amazon
SportsMultimodalGenerative AI
M
Md Sirajus Salekin
Amazon Web Services
P
Parmida Atighehchian
Amazon Web Services
S
Shinan Zhang
Amazon Web Services