S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the challenge of synchronously transporting multiple freely rolling balls with a wheeled quadrupedal robot in unstructured environments, where indistinguishable ball identities and inter-frame permutation symmetries hinder policy learning. To overcome this, the authors propose the Per-Frame Deep Sets (PFDS) architecture, which applies permutation-invariant pooling independently to each frame within a historical sequence before temporal modeling, thereby strictly enforcing frame-level symmetry. They prove that PFDS can universally approximate continuous G-frame invariant policies. Integrated with TactSet—a compact, symmetric perceptual representation distilled from tactile contact graphs—PFDS achieves 100% success in five-ball transport tasks in simulation, significantly outperforming MLPs, branched encoders, and conventional Deep Sets. Moreover, TactSet effectively substitutes privileged state observations without performance degradation.

📝 Abstract

We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a \emph{per-frame permutation symmetry} that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline (\HCDS) fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose \textbf{Per-Frame Deep Sets (\PFDS)}, which performs permutation-invariant pooling within each history frame before temporal readout; we prove that \PFDS is $\Gframe$-invariant and universally approximates continuous $\Gframe$-invariant policies. A $2{\times}2$ ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and \PFDS reaches the five-sphere stage with 100\% no-drop transport in simulation across all five random seeds. We further distill the \PFDS teacher into \TactSet via DAgger, replacing privileged sphere-state observations with a $16{\times}16$ Boolean union contact map, yielding a compact and naturally $\Gframe$-invariant tactile representation.

Problem

Research questions and friction points this paper is trying to address.

multi-sphere transport

per-frame permutation symmetry

loco-manipulation

wheel-legged robot

identity-free dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Per-Frame Deep Sets

permutation symmetry

loco-manipulation