MSUE: Multi-Modal Soccer Understanding Expert

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multimodal semantic understanding in visual question answering (VQA) for soccer videos by proposing a multi-expert dynamic routing architecture. In this framework, a large language model dynamically directs queries to specialized expert modules—text, image, or video—based on question semantics. The text expert leverages Gemini-3 Flash, the image expert employs a fine-tuned Qwen3-VL model, and the video expert integrates external knowledge. Additionally, the authors introduce a cost-effective data synthesis method grounded in vision-language models to efficiently generate diverse training samples. Evaluated on the SoccerNet VQA Challenge benchmark, the proposed approach achieves 95% accuracy, ranking third on the leaderboard and demonstrating significantly enhanced multimodal reasoning capabilities in complex soccer scenarios.
📝 Abstract
This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf{0.95} on the challenge benchmark, securing third place in the leaderboard.
Problem

Research questions and friction points this paper is trying to address.

SoccerNet
Visual Question Answering
Multi-Modal Understanding
Sports AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Multi-Expert Architecture
Data Synthesis
Visual Question Answering
Large Language Model
🔎 Similar Papers
No similar papers found.