Dense Video Captioning Using Unsupervised Semantic Information

📅 2021-12-15
🏛️ Journal of Visual Communication and Image Representation
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
To address the strong coupling between event boundary localization and semantic description—and the reliance on labor-intensive, event-level annotations—in dense video captioning, this paper proposes the first end-to-end framework requiring no event-level supervision. Methodologically, it pioneers the integration of unsupervised video temporal segmentation, self-supervised temporal modeling, contrastive learning–driven cross-modal alignment, and a Transformer-based decoder to autonomously discover event structures and generate fine-grained captions directly from raw video. Evaluated on ActivityNet Captions and YouCook2, the approach achieves new state-of-the-art performance: +8.2% in event localization F1-score and +4.7% in captioning BLEU-4. By eliminating the need for manual event annotations, it significantly reduces annotation cost and establishes a novel paradigm for weakly supervised video understanding.
Problem

Research questions and friction points this paper is trying to address.

Dense Captioning
Unsupervised Learning
Video Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic Video Understanding
Dense Captioning
Unsupervised Learning
🔎 Similar Papers
No similar papers found.
V
Valter Estevam
Federal Institute of Paraná, Irati-PR, 84500-000, Brazil; Federal University of Paraná, Department of Informatics, Curitiba-PR, 81531-970, Brazil
Rayson Laroca
Rayson Laroca
Pontifical Catholic University of Paraná (PUCPR)
Computer VisionDeep LearningPattern Recognition
H
H. Pedrini
University of Campinas, Institute of Computing, Campinas-SP, 13083-852, Brazil
D
D. Menotti
Federal University of Paraná, Department of Informatics, Curitiba-PR, 81531-970, Brazil