GOAL: Global-local Object Alignment Learning

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (e.g., CLIP) excel at short-text tasks but suffer substantial performance degradation in long-text image retrieval, primarily due to their pretraining on predominantly short captions and limited capacity to model complex semantic structures. To address this, we propose a global–local collaborative alignment framework, comprising two-stage fine-tuning: it preserves CLIP’s global semantic alignment capability while introducing fine-grained local image region–sentence matching (LISM) and a token-similarity-based attention propagation mechanism (TSL) for interpretable, scalable cross-modal alignment. Our method integrates region-level visual feature extraction, sentence-level text segmentation, and token-level attention distillation. Evaluated on three newly constructed long-text image retrieval benchmarks, our approach significantly outperforms strong baselines, yielding cross-modal embeddings with enhanced discriminability and fine-grained semantic consistency.

Technology Category

Application Category

📝 Abstract
Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present GOAL (Global-local Object Alignment Learning), a novel fine-tuning method that enhances CLIP's ability to handle lengthy text by leveraging both global and local semantic alignments between image and lengthy text. Our approach consists of two key components: Local Image-Sentence Matching (LISM), which identifies corresponding pairs between image segments and descriptive sentences, and Token Similarity-based Learning (TSL), which efficiently propagates local element attention through these matched pairs. Evaluating GOAL on three new benchmarks for image-lengthy text retrieval, we demonstrate significant improvements over baseline CLIP fine-tuning, establishing a simple yet effective approach for adapting CLIP to detailed textual descriptions. Through extensive experiments, we show that our method's focus on local semantic alignment alongside global context leads to more nuanced and representative embeddings, particularly beneficial for tasks requiring fine-grained understanding of lengthy text descriptions.
Problem

Research questions and friction points this paper is trying to address.

Enhance CLIP's ability to handle lengthy text descriptions
Improve global-local semantic alignment between images and text
Address fine-grained understanding challenges in lengthy text retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances CLIP with global-local alignment
Uses Local Image-Sentence Matching (LISM)
Implements Token Similarity-based Learning (TSL)
🔎 Similar Papers
No similar papers found.