🤖 AI Summary
To address the strong coupling between event boundary localization and semantic description—and the reliance on labor-intensive, event-level annotations—in dense video captioning, this paper proposes the first end-to-end framework requiring no event-level supervision. Methodologically, it pioneers the integration of unsupervised video temporal segmentation, self-supervised temporal modeling, contrastive learning–driven cross-modal alignment, and a Transformer-based decoder to autonomously discover event structures and generate fine-grained captions directly from raw video. Evaluated on ActivityNet Captions and YouCook2, the approach achieves new state-of-the-art performance: +8.2% in event localization F1-score and +4.7% in captioning BLEU-4. By eliminating the need for manual event annotations, it significantly reduces annotation cost and establishes a novel paradigm for weakly supervised video understanding.