π€ AI Summary
This work addresses a critical limitation in self-supervised dynamic representation learning, where existing contrastive predictive objectives often misinterpret slowly varying noise within trajectories as genuine dynamical signals, leading to noise-dominated representations and degraded downstream performance. The authors identify this issue as stemming from an inherent inductive bias flaw in standard contrastive objectives and propose a general corrective principle: sampling negative examples from within the same trajectory to eliminate predictive shortcuts introduced by slow-varying noise, thereby compelling the encoder to focus on the true dynamical variables governing system evolution. Experiments based on frameworks such as JEPA and DySIB on synthetic moving-point and rigid-pendulum video datasets demonstrate that the proposed approach effectively disentangles slow noise from authentic dynamics, yields representations whose quality improves with trajectory length, and significantly enhances downstream task performance under strong noise conditions.
π Abstract
Self-supervised methods that learn representations and predict dynamics fully in the latent space, such as JEPA, have been shown to confuse slowly varying noise with the dynamical signals they aim to capture. Specifically, when noise features remain approximately constant within each trajectory, contrastive predictive objectives preferentially encode these features instead of the true latent variables governing the system. The learned representation then becomes dominated by trajectory-specific noise, so downstream performance degrades with noise strength and does not improve even as the number and duration of training trajectories increase. We argue that this failure is a property of the objective itself, shared by a long line of contrastive predictive objectives that sample negatives across trajectories. To illustrate this generality, we study the failure mode and its remedy in two settings: a standard SimCLR-style JEPA on a synthetic moving-dot dataset, and DySIB, a recently introduced method designed for extracting physically interpretable representations of dynamics, on movies of a rigid-body pendulum. When negatives are instead sampled within a single trajectory, the slow noise can no longer distinguish frames within that trajectory, removing the predictive shortcut. Training one encoder simultaneously on many such trajectories then forces it to encode the variables relevant for the dynamics, with longer trajectories yielding better representations even for strong slow noise. Our results point toward principles for designing contrastive predictive objectives in dynamical representation learning, especially for physical systems with noisy experimental observations.