TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the limitations of existing monocular 6-DoF spacecraft pose estimation methods, which often neglect temporal information or require either fine-tuning backbone networks—risking catastrophic forgetting—or incorporating computationally expensive optical flow modules. To overcome these challenges, the authors propose TALON, a framework that integrates lightweight spatio-temporal 3D adapters before the frozen self-attention layers of a Vision Transformer (ViT) backbone. TALON further introduces a geometry-aware patch-token alignment loss based on keypoint structures and a prototype-conditioned KL divergence constraint to spatially calibrate intermediate feature activations. With less than 5% additional parameters, the method reduces pose error by 50% on SPADES, improves ADD-0.1d accuracy by 21.8% on SwissCube, and achieves zero-shot sim-to-real transfer on the SPARK dataset, lowering error by a factor of 4.7.
📝 Abstract
Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.
Problem

Research questions and friction points this paper is trying to address.

6-DoF pose estimation
temporal information
monocular vision
spacecraft navigation
catastrophic forgetting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-Aligned Adapters
6-DoF Pose Estimation
Frozen Vision Transformer
Spatiotemporal Modeling
Patch-Token Alignment
🔎 Similar Papers