Dense Video Captioning Using Unsupervised Semantic Information

📅 2021-12-15

🏛️ Journal of Visual Communication and Image Representation

📈 Citations: 7

✨ Influential: 0

🤖 AI Summary

To address the strong coupling between event boundary localization and semantic description—and the reliance on labor-intensive, event-level annotations—in dense video captioning, this paper proposes the first end-to-end framework requiring no event-level supervision. Methodologically, it pioneers the integration of unsupervised video temporal segmentation, self-supervised temporal modeling, contrastive learning–driven cross-modal alignment, and a Transformer-based decoder to autonomously discover event structures and generate fine-grained captions directly from raw video. Evaluated on ActivityNet Captions and YouCook2, the approach achieves new state-of-the-art performance: +8.2% in event localization F1-score and +4.7% in captioning BLEU-4. By eliminating the need for manual event annotations, it significantly reduces annotation cost and establishes a novel paradigm for weakly supervised video understanding.

Problem

Research questions and friction points this paper is trying to address.

Dense Captioning

Unsupervised Learning

Video Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic Video Understanding

Dense Captioning

Unsupervised Learning

🔎 Similar Papers

No similar papers found.

Authors to Follow