🤖 AI Summary
Traditional multimodal relation extraction (MRE) relies on discrete classification paradigms, neglecting structural constraints—such as entity types and relative positions—and struggling to capture fine-grained semantic relations. To address these limitations, we propose a novel *retrieval-based relation extraction* paradigm, reformulating relation identification as a natural language description-driven semantic matching task. Our approach incorporates entity types and relative positional information as explicit structural constraints, leverages large language models to generate fine-grained, descriptive relation texts, and employs a multimodal encoder with contrastive learning to achieve cross-modal semantic alignment. This design significantly enhances model capability in modeling complex and ambiguous relations, while improving robustness and interpretability. Extensive experiments demonstrate state-of-the-art performance on the MNRE and MORE benchmarks, consistently outperforming existing classification-based MRE methods across all evaluation metrics.
📝 Abstract
Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose underline{R}etrieval underline{O}ver underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.