🤖 AI Summary
Existing cross-modal retrieval methods rely heavily on supervised annotations or explicit modality mappings, hindering the reuse of pretrained encoders and incurring high annotation costs. To address this, we propose FemmIR—a novel framework that introduces attribute-aware graph edit distance into weakly supervised multimodal retrieval for the first time. FemmIR leverages only user-provided multimodal exemplars as supervision, eliminating the need for similarity labels or model fine-tuning. It seamlessly integrates large language models and vision pretrained encoders, modeling sample attributes and relational constraints via graph structures, and employs a multi-level interactive scoring mechanism for cross-modal matching. Evaluated on our newly constructed missing-person dataset MuQNOL, FemmIR achieves performance on par with fully supervised baselines in both exact and approximate retrieval tasks, while drastically reducing data dependency and enabling open-world, plug-and-play retrieval.
📝 Abstract
Existing multi-media retrieval models either rely on creating a common subspace with modality-specific representation models or require schema mapping among modalities to measure similarities among multi-media data. Our goal is to avoid the annotation overhead incurred from considering retrieval as a supervised classification task and re-use the pretrained encoders in large language models and vision tasks. We propose "FemmIR", a framework to retrieve multimodal results relevant to information needs expressed with multimodal queries by example without any similarity label. Such identification is necessary for real-world applications where data annotations are scarce and satisfactory performance is required without fine-tuning with a common framework across applications. We curate a new dataset called MuQNOL for benchmarking progress on this task. Our technique is based on weak supervision introduced through edit distance between samples: graph edit distance can be modified to consider the cost of replacing a data sample in terms of its properties, and relevance can be measured through the implicit signal from the amount of edit cost among the objects. Unlike metric learning or encoding networks, FemmIR re-uses the high-level properties and maintains the property value and relationship constraints with a multi-level interaction score between data samples and the query example provided by the user. We empirically evaluate FemmIR on a missing person use case with MuQNOL. FemmIR performs comparably to similar retrieval systems in delivering on-demand retrieval results with exact and approximate similarities while using the existing property identifiers in the system.