SVG Decomposition for Enhancing Large Multimodal Models Visualization Comprehension: A Study with Floor Plans

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large multimodal models (LMMs) exhibit significant limitations in spatial reasoning tasks over structured visualizations such as floor plans. Method: We propose scalable vector graphics (SVG) as a structured decomposition tool—explicitly encoding geometric, topological, and semantic information—and integrate it with rasterized inputs (PNG) to systematically evaluate its enhancement effect on three state-of-the-art LMMs: GPT-4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct. Results: SVG+PNG fusion improves basic spatial understanding (e.g., room identification) but degrades performance on deeper reasoning tasks like path planning, exposing deficiencies in current decomposition strategies regarding hierarchical coupling and cognitive balance. This work provides the first systematic empirical validation of SVG as an interpretable intermediate representation for LMM-based spatial reasoning, delineating both its potential and fundamental boundaries. It establishes a novel paradigm and evidence-based foundation for robust, accessibility-oriented floor plan understanding.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) are increasingly capable of interpreting visualizations, yet they continue to struggle with spatial reasoning. One proposed strategy is decomposition, which breaks down complex visualizations into structured components. In this work, we examine the efficacy of scalable vector graphics (SVGs) as a decomposition strategy for improving LMMs'performance on floor plans comprehension. Floor plans serve as a valuable testbed because they combine geometry, topology, and semantics, and their reliable comprehension has real-world applications, such as accessibility for blind and low-vision individuals. We conducted an exploratory study with three LMMs (GPT-4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct) across 75 floor plans. Results show that combining SVG with raster input (SVG+PNG) improves performance on spatial understanding tasks but often hinders spatial reasoning, particularly in pathfinding. These findings highlight both the promise and limitations of decomposition as a strategy for advancing spatial visualization comprehension.

Problem

Research questions and friction points this paper is trying to address.

Improving spatial reasoning in large multimodal models

Evaluating SVG decomposition for floor plan comprehension

Addressing limitations in spatial visualization understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

SVG decomposition for complex visualization components

Combining SVG with raster input for spatial tasks

Evaluating decomposition on floor plans with LMMs

🔎 Similar Papers

Visually Descriptive Language Model for Vector Graphics Reasoning