🤖 AI Summary
General-purpose vision-language models (VLMs) exhibit limited transferability to domain-specific sports video understanding. Method: This paper proposes a curriculum learning-based adaptation paradigm for vertical domains, using football as a case study. It (1) leverages large language models (LLMs) to automatically generate high-quality football video instruction data; (2) designs a staged curriculum progressing from conceptual recognition to complex reasoning; and (3) integrates football-specific video clip annotations with multimodal alignment fine-tuning. Results: On football visual question answering, the method achieves a 37.5% relative improvement; action classification accuracy rises significantly from 11.8% to 63.5%. This work is the first to systematically demonstrate that open-source VLMs can be efficiently adapted to specialized domains via lightweight domain data and curriculum learning—establishing a reusable methodological framework for vertical-domain VLM customization.
📝 Abstract
Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.