π€ AI Summary
Latent Action Models (LAMs) are highly susceptible to irrelevant distractions such as background clutter, which hinders effective disentanglement of action-relevant features and leads to suboptimal latent action spaces. To address this limitation, this work proposes MaskLAM, a lightweight approach that leverages a pretrained vision-based segmentation model to generate foreground object masks and reweights the LAMβs reconstruction loss accordingly. Without altering the underlying model architecture, this loss-weighting mechanism explicitly encourages the model to focus on action-relevant regions during training. Empirical results demonstrate that MaskLAM substantially improves robustness: in MuJoCo tasks with distracting backgrounds, it achieves up to a 4Γ increase in cumulative reward and a 3Γ improvement in latent action quality as measured by linear probe evaluation.
π Abstract
Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM -- a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on continuous-control MuJoCo tasks, modified with action-correlated background noise. Our approach yields up to a 4x increase in accrued rewards compared to standard baselines and a 3x improvement in the latent action quality, as evidenced by linear probe evaluation.