A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual localization methods rely on feature matching, incurring high computational and memory overhead; while scene coordinate regression (SCR) alleviates storage demands, mainstream CNN-based architectures neglect inter-pixel spatial relationships. This paper proposes the first descriptor-level graph attention mechanism tailored for SCR, building an end-to-end multimodal SCR framework based on Vision Transformers that supports diverse geometric inputs—including dense depth maps and sparse SLAM/SfM outputs. Our design explicitly models pixel-wise spatial dependencies, significantly improving cross-scene generalization and deployment flexibility. Evaluated on multiple benchmarks, our method achieves localization accuracy comparable to state-of-the-art matching-based approaches, while reducing model parameters by 42% and accelerating inference by 3.1×. We publicly release both source code and pre-trained models.

Technology Category

Application Category

📝 Abstract
Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: https://github.com/ais-lab/A-SCoRe.
Problem

Research questions and friction points this paper is trying to address.

Reduces storage and compute burden in visual localization.
Improves spatial relationship understanding in scene coordinate regression.
Enhances flexibility for mobile robots in diverse environments.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-based model for scene coordinate regression
Leverages attention on descriptor map level
Works with multiple data modalities flexibly
🔎 Similar Papers
No similar papers found.
H
Huy-Hoang Bui
Graduate School of Information Science and Engineering, Ritsumeikan University, Osaka, Japan
B
B. Bui
Graduate School of Information Science and Engineering, Ritsumeikan University, Osaka, Japan
Q
Quang-Vinh Tran
Graduate School of Information Science and Engineering, Ritsumeikan University, Osaka, Japan
Yasuyuki Fujii
Yasuyuki Fujii
Ritsumeikan University
RoboticsWater surface vehicleWater quality monitoring
Joo-Ho Lee
Joo-Ho Lee
Professor, Ritsumeikan University
Intelligent SpaceSystem IntegrationMachine Learning