Domain Adaptation of VLM for Soccer Video Understanding

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

General-purpose vision-language models (VLMs) exhibit limited transferability to domain-specific sports video understanding. Method: This paper proposes a curriculum learning-based adaptation paradigm for vertical domains, using football as a case study. It (1) leverages large language models (LLMs) to automatically generate high-quality football video instruction data; (2) designs a staged curriculum progressing from conceptual recognition to complex reasoning; and (3) integrates football-specific video clip annotations with multimodal alignment fine-tuning. Results: On football visual question answering, the method achieves a 37.5% relative improvement; action classification accuracy rises significantly from 11.8% to 63.5%. This work is the first to systematically demonstrate that open-source VLMs can be efficiently adapted to specialized domains via lightweight domain data and curriculum learning—establishing a reusable methodological framework for vertical-domain VLM customization.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.

Problem

Research questions and friction points this paper is trying to address.

Exploring VLM adaptability to specialized domains like soccer

Improving soccer video understanding via domain-specific fine-tuning

Enhancing VLM performance on soccer tasks with curated data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large-scale soccer datasets for domain adaptation

Employs LLM to create instruction-following training data

Applies curriculum learning for iterative VLM fine-tuning

🔎 Similar Papers

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation