MSUE: Multi-Modal Soccer Understanding Expert

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of multimodal semantic understanding in visual question answering (VQA) for soccer videos by proposing a multi-expert dynamic routing architecture. In this framework, a large language model dynamically directs queries to specialized expert modules—text, image, or video—based on question semantics. The text expert leverages Gemini-3 Flash, the image expert employs a fine-tuned Qwen3-VL model, and the video expert integrates external knowledge. Additionally, the authors introduce a cost-effective data synthesis method grounded in vision-language models to efficiently generate diverse training samples. Evaluated on the SoccerNet VQA Challenge benchmark, the proposed approach achieves 95% accuracy, ranking third on the leaderboard and demonstrating significantly enhanced multimodal reasoning capabilities in complex soccer scenarios.

📝 Abstract

This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf{0.95} on the challenge benchmark, securing third place in the leaderboard.

Problem

Research questions and friction points this paper is trying to address.

SoccerNet

Visual Question Answering

Multi-Modal Understanding

Sports AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model

Multi-Expert Architecture

Data Synthesis

Visual Question Answering