Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability

๐Ÿ“… 2026-01-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates whether non-Markovian memory effects exist in the training dynamics of stochastic gradient descent (SGD). To this end, it introduces the first model-agnostic, architecture-agnostic empirical framework: training trajectories are modeled as multi-time process tensors, and a diagnostic protocol based on distinguishability backflow is devised by analyzing the mapping between controlled intervention sequences and model outputs. Prediction distribution discrepancies are quantified using total variation distance, Jensenโ€“Shannon divergence, and Hellinger distance, complemented by causal interruption experiments and bootstrap confidence intervals. Across diverse experimental settings, robust and statistically significant positive backflow signals are consistently observed, confirming the non-Markovian nature of SGD. This effect intensifies with momentum, batch overlap, and smaller step sizes, and vanishes upon resetting the optimizer state.

Technology Category

Application Category

๐Ÿ“ Abstract
This work proposes neural training as a \emph{process tensor}: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emph{back-flow of distinguishability}. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase $\Delta_{\mathrm{BF}} = D_2 - D_1>0$ (with $D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emph{causal break} (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emph{measurement} contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings."Data order matters"turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.
Problem

Research questions and friction points this paper is trying to address.

non-Markovianity
training memory
stochastic gradient descent
back-flow of distinguishability
process tensor
Innovation

Methods, ideas, or system contributions that make the work stand out.

process tensor
non-Markovianity
back-flow of distinguishability
SGD memory
training dynamics
๐Ÿ”Ž Similar Papers
No similar papers found.