🤖 AI Summary
This work addresses the limitations of existing monocular 6-DoF spacecraft pose estimation methods, which often neglect temporal information or require either fine-tuning backbone networks—risking catastrophic forgetting—or incorporating computationally expensive optical flow modules. To overcome these challenges, the authors propose TALON, a framework that integrates lightweight spatio-temporal 3D adapters before the frozen self-attention layers of a Vision Transformer (ViT) backbone. TALON further introduces a geometry-aware patch-token alignment loss based on keypoint structures and a prototype-conditioned KL divergence constraint to spatially calibrate intermediate feature activations. With less than 5% additional parameters, the method reduces pose error by 50% on SPADES, improves ADD-0.1d accuracy by 21.8% on SwissCube, and achieves zero-shot sim-to-real transfer on the SPARK dataset, lowering error by a factor of 4.7.
📝 Abstract
Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.