Gradient Free Deep Reinforcement Learning With TabPFN

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Gradient-based deep reinforcement learning (DRL) methods suffer from hyperparameter sensitivity, training instability, and high computational overhead. To address these challenges, this paper proposes TabPFN RL—the first framework leveraging the meta-trained Transformer TabPFN for gradient-free RL. TabPFN RL replaces conventional Q-networks with TabPFN as a non-parametric Q-function approximator and performs zero-shot Q-value inference via in-context learning, eliminating backpropagation entirely. It introduces two key innovations: (i) a high-reward trajectory retention mechanism and (ii) an episode-gating strategy, both designed to mitigate context-length limitations. We formally analyze capacity constraints and truncation bounds inherent in contextual RL. Evaluated on Gymnasium’s classic control benchmarks, TabPFN RL matches or surpasses DQN in performance—without gradient updates or hyperparameter tuning—demonstrating the feasibility and promise of pre-trained foundation models for efficient, lightweight RL.

Technology Category

Application Category

📝 Abstract
Gradient based optimization is fundamental to most modern deep reinforcement learning algorithms, however, it introduces significant sensitivity to hyperparameters, unstable training dynamics, and high computational costs. We propose TabPFN RL, a novel gradient free deep RL framework that repurposes the meta trained transformer TabPFN as a Q function approximator. Originally developed for tabular classification, TabPFN is a transformer pre trained on millions of synthetic datasets to perform inference on new unseen datasets via in context learning. Given an in context dataset of sample label pairs and new unlabeled data, it predicts the most likely labels in a single forward pass, without gradient updates or task specific fine tuning. We use TabPFN to predict Q values using inference only, thereby eliminating the need for back propagation at both training and inference. To cope with the model's fixed context budget, we design a high reward episode gate that retains only the top 5% of trajectories. Empirical evaluations on the Gymnasium classic control suite demonstrate that TabPFN RL matches or surpasses Deep Q Network on CartPole v1, MountainCar v0, and Acrobot v1, without applying gradient descent or any extensive hyperparameter tuning. We discuss the theoretical aspects of how bootstrapped targets and non stationary visitation distributions violate the independence assumptions encoded in TabPFN's prior, yet the model retains a surprising generalization capacity. We further formalize the intrinsic context size limit of in context RL algorithms and propose principled truncation strategies that enable continual learning when the context is full. Our results establish prior fitted networks such as TabPFN as a viable foundation for fast and computationally efficient RL, opening new directions for gradient free RL with large pre trained transformers.
Problem

Research questions and friction points this paper is trying to address.

Eliminates gradient-based optimization in deep reinforcement learning
Uses TabPFN transformer for Q-value prediction without backpropagation
Addresses fixed context budget via high-reward trajectory retention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-free Q function using TabPFN transformer
In-context learning without backpropagation or fine-tuning
High-reward episode gate with top trajectories retention
🔎 Similar Papers
No similar papers found.