๐ค AI Summary
This study investigates whether non-Markovian memory effects exist in the training dynamics of stochastic gradient descent (SGD). To this end, it introduces the first model-agnostic, architecture-agnostic empirical framework: training trajectories are modeled as multi-time process tensors, and a diagnostic protocol based on distinguishability backflow is devised by analyzing the mapping between controlled intervention sequences and model outputs. Prediction distribution discrepancies are quantified using total variation distance, JensenโShannon divergence, and Hellinger distance, complemented by causal interruption experiments and bootstrap confidence intervals. Across diverse experimental settings, robust and statistically significant positive backflow signals are consistently observed, confirming the non-Markovian nature of SGD. This effect intensifies with momentum, batch overlap, and smaller step sizes, and vanishes upon resetting the optimizer state.
๐ Abstract
This work proposes neural training as a \emph{process tensor}: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emph{back-flow of distinguishability}. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase $\Delta_{\mathrm{BF}} = D_2 - D_1>0$ (with $D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emph{causal break} (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emph{measurement} contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings."Data order matters"turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.