🤖 AI Summary
Vision-language models (VLMs) exhibit limited performance in low-resource video classification due to the “reasoning gap” between sparse domain-specific spatiotemporal content and abstract class labels. To address this, we propose a two-stage self-improving fine-tuning framework. In Stage I, prompt engineering enables VLMs to autonomously generate domain-specific video reasoning texts—serving as interpretable, intermediate supervision signals. In Stage II, we jointly optimize self-supervised rationale-guided fine-tuning and standard supervised fine-tuning. This is the first work to integrate self-generated textual rationales into VLM-based video understanding without requiring additional human annotations, thereby significantly enhancing the model’s capacity for domain-specific spatiotemporal reasoning. Extensive experiments on multiple low-data video benchmarks demonstrate consistent superiority over conventional supervised fine-tuning, validating both effectiveness and generalizability under data-scarce conditions.
📝 Abstract
Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical extit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.