The Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the challenge of domain shift in rare first-person video question answering tasks—such as those involving surgery, industrial assembly, extreme sports, and animal perspectives—within the EgoCross benchmark, where frozen multimodal large language models struggle to transfer knowledge effectively. The authors propose a nearly training-free, domain-aware reasoning strategy that tailors input formatting, prompt templates, and answer mapping mechanisms specifically for each of the four target domains. By decoupling inference across domains and leveraging only minimal supervision (as few as two SFT checkpoints for some domains), this approach activates knowledge within the frozen Qwen3-VL-4B model efficiently. It achieves a strong overall accuracy of 66.98% on the final evaluation, substantially outperforming conventional methods and demonstrating both its effectiveness and novelty.

📝 Abstract

EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather than ordinary daily-life scenes. In the source-limited track, the base model is fixed to Qwen3-VL-4B, while the official task-specific support set contains only 20 training samples. This setting makes the challenge less about model scaling and more about exposing the right visual, temporal, and answer-selection cues to a constrained model. Our key observation is that the frozen baseline model is not simply incapable of these rare scenarios; rather, it often fails to transfer its existing visual-language knowledge to the new task format without an appropriate interface. We therefore use a domain-wise inference strategy that treats the four target domains separately and designs different input, prompting, and answer-mapping procedures according to each domain's task characteristics. These strategies make the rare egocentric scenes more interpretable to the VLM by emphasizing the cues that matter for each domain. The resulting system is nearly training-free: surgery, and animal questions are answered with the base Qwen3-VL-4B model, while XSports and industry use only the official SFT checkpoint trained for two epochs on the provided 20 training samples. On the final evaluation, this simple strategy reaches 66.98\% overall accuracy, suggesting that careful domain-aware inference can compensate for limited base-model strength and recover much of the ability already present in the baseline model.

Problem

Research questions and friction points this paper is trying to address.

domain shift

egocentric video question answering

few-shot learning

multimodal large language models

training-free inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-wise inference

nearly training-free

egocentric video QA