Enhancing Vision-Language Pre-Training with Rich Supervisions

๐Ÿ“… 2024-03-05
๐Ÿ›๏ธ Computer Vision and Pattern Recognition
๐Ÿ“ˆ Citations: 11
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address weak visual-linguistic alignment and poor downstream adaptability in web-based vision-language pretraining, this paper proposes S4, a strongly supervised pretraining paradigm leveraging large-scale webpage screenshots. Methodologically, S4 renders webpages, parses their HTML DOM structures, models element spatial coordinates, and jointly trains on ten low-cost, high-fidelity, scenario-driven pretraining tasksโ€”uniquely integrating DOM hierarchy, spatial layout, and multi-granularity textual semantics to bridge the gap between native web visual-text alignment and downstream task requirements. Evaluated on nine mainstream downstream tasks, S4 achieves substantial performance gains: +76.1% F1 for table detection and >1% average improvement across tasks including UI component captioning. The approach advances web-centric multimodal representation learning by grounding pretraining signals directly in structural and spatial web semantics.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble down- stream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1 % on Widget Captioning.
Problem

Research questions and friction points this paper is trying to address.

Enhance Vision-Language Models using web screenshots.
Design 10 pre-training tasks with HTML hierarchy.
Improve performance in diverse downstream tasks significantly.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses web screenshots for rich visual-textual cues
Leverages HTML hierarchy for 10 pre-training tasks
Improves model performance across diverse downstream tasks
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yuan Gao
Stanford University
K
Kunyu Shi
AWS AI Labs
P
Pengkai Zhu
AWS AI Labs
E
Edouard Belval
AWS AI Labs
Oren Nuriel
Oren Nuriel
Amazon
Deep Learning
S
Srikar Appalaraju
AWS AI Labs
S
Shabnam Ghadar
AWS AI Labs
Vijay Mahadevan
Vijay Mahadevan
AWS AI Labs
Zhuowen Tu
Zhuowen Tu
Professor, Cognitive Science, Computer Science&Engineering, UC San Diego
Computer VisionMachine LearningDeep LearningNeural Computation
S
S. Soatto
AWS AI Labs