🤖 AI Summary
Action recognition models often exhibit background bias—over-relying on contextual cues at the expense of motion semantics—thereby compromising generalization and robustness. This work presents the first systematic evaluation of background bias across three model families: standard classification models, contrastive vision-language pre-trained models (e.g., CLIP), and video large language models (VLLMs). To mitigate this bias, we propose a dual-path disentanglement framework: (1) input purification via human instance segmentation to explicitly remove background interference, and (2) prompt tuning that synergistically combines handcrafted priors with automated search to steer attention toward human pose and motion dynamics. We introduce a quantitative bias evaluation framework to rigorously assess mitigation efficacy. Experiments show a 3.78% reduction in background bias for classification models and a 9.85% improvement in human-centric focus for VLLMs on action discrimination tasks, yielding substantial gains in cross-background robustness.
📝 Abstract
Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.