🤖 AI Summary
This work addresses the challenge of unknown policy identities in unlabeled multi-policy behavioral data by proposing Behavioral INR, a self-supervised generative model based on Implicit Neural Representations (INRs)—the first to apply INRs to unsupervised policy representation learning. The method models each policy as a function mapping states to actions and achieves policy disentanglement through segment-level latent variables and FiLM modulation. It accommodates variable-length trajectories and heterogeneous sampling granularities, and introduces a policy-level out-of-distribution (OOD) evaluation metric grounded in state-action distributions. Evaluated across diverse domains—including MuJoCo, chess, F1 racing, robotics, and Seek-Avoid—Behavioral INR substantially enhances policy identifiability in continuous state-action spaces, demonstrating superior performance particularly in settings involving long trajectories, multiple policies, and strong OOD conditions.
📝 Abstract
We study policy representation learning from unlabeled multi-policy behavioral data. Each episode is generated by a fixed policy, but policy labels are unavailable. This setting appears in robotics play, demonstrations, games, racing, and other datasets where heterogeneous behaviors are mixed without annotations. We introduce \emph{Behavioral INR}, a self-supervised generative model that adapts implicit neural representations (INRs) from vision to behavior. Instead of mapping coordinates to RGB values, Behavioral INR represents a policy as a state-action function mapping states to subsequent actions. An episode-level latent modulates this function through FiLM layers, yielding a generative prior over policies and allowing policy identity to be inferred without supervision. Because INRs treat each datapoint as samples from an underlying function, the same model naturally accommodates variable episode lengths and different sampling granularities, as in vision INRs with different image resolutions. We also define policy-level out-of-distribution (OOD) shifts along state-distribution and action-distribution axes, which arise when policies overlap in states or actions but are not captured by standard behavioral OOD settings based only on new agents or environments. We evaluate on synthetic Gaussian random field data, MuJoCo demonstrations with controlled OOD splits, and real-world chess, Formula 1 racing, robotics, and Seek-Avoid datasets. Behavioral INR most consistently improves policy identifiability in the hardest continuous state-action settings, especially when longer episodes, more policies, and OOD splits reduce the usefulness of marginal shortcuts; amortized history encoders remain competitive when policy identity can be recovered from symbolic repetition or low-dimensional action statistics. We release code and checkpoints.