๐ค AI Summary
To address the challenges of cross-modal source retrieval and ambiguous reasoning paths in multimodal multi-hop question answering, this paper proposes a Semantic Graph Reasoning Network (SGRN) driven by syntactic sentence structure. SGRN constructs a heterogeneous semantic graph to model fine-grained inter-modal associations between text and images, replacing computationally expensive cross-modal Transformers with a lightweight graph message-passing mechanism to explicitly learn reasoning paths over multi-source supporting facts. Crucially, it is the first work to demonstrate that graph topology can serve as task-relevant prior knowledge to effectively guide multimodal multi-hop retrieval. On the WebQA benchmark, SGRN achieves a 4.6% improvement in retrieval F1 over Transformer-based baselines, while reducing model parameters by 37% and accelerating inference by 2.1รโdemonstrating superior efficiency and generalizability in large-scale retrieval scenarios.
๐ Abstract
This work deals with the challenge of learning and reasoning over multi-modal multi-hop question answering (QA). We propose a graph reasoning network based on the semantic structure of the sentences to learn multi-source reasoning paths and find the supporting facts across both image and text modalities for answering the question. In this paper, we investigate the importance of graph structure for multi-modal multi-hop question answering. Our analysis is centered on WebQA. We construct a strong baseline model, that finds relevant sources using a pairwise classification task. We establish that, with the proper use of feature representations from pre-trained models, graph structure helps in improving multi-modal multi-hop question answering. We point out that both graph structure and adjacency matrix are task-related prior knowledge, and graph structure can be leveraged to improve the retrieval performance for the task. Experiments and visualized analysis demonstrate that message propagation over graph networks or the entire graph structure can replace massive multimodal transformers with token-wise cross-attention. We demonstrated the applicability of our method and show a performance gain of extbf{4.6$%$} retrieval F1score over the transformer baselines, despite being a very light model. We further demonstrated the applicability of our model to a large scale retrieval setting.