Enhancing Vision-Language Pre-Training with Rich Supervisions

📅 2024-03-05

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 11

✨ Influential: 1

🤖 AI Summary

To address weak visual-linguistic alignment and poor downstream adaptability in web-based vision-language pretraining, this paper proposes S4, a strongly supervised pretraining paradigm leveraging large-scale webpage screenshots. Methodologically, S4 renders webpages, parses their HTML DOM structures, models element spatial coordinates, and jointly trains on ten low-cost, high-fidelity, scenario-driven pretraining tasks—uniquely integrating DOM hierarchy, spatial layout, and multi-granularity textual semantics to bridge the gap between native web visual-text alignment and downstream task requirements. Evaluated on nine mainstream downstream tasks, S4 achieves substantial performance gains: +76.1% F1 for table detection and >1% average improvement across tasks including UI component captioning. The approach advances web-centric multimodal representation learning by grounding pretraining signals directly in structural and spatial web semantics.

Technology Category

Application Category

📝 Abstract

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble down- stream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1 % on Widget Captioning.

Problem

Research questions and friction points this paper is trying to address.

Enhance Vision-Language Models using web screenshots.

Design 10 pre-training tasks with HTML hierarchy.

Improve performance in diverse downstream tasks significantly.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses web screenshots for rich visual-textual cues

Leverages HTML hierarchy for 10 pre-training tasks

Improves model performance across diverse downstream tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow