Discovering Reinforcement Learning Interfaces with Large Language Models

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the heavy reliance on manual design in defining environment interfaces—specifically observation mappings and reward functions—in reinforcement learning, for which automated solutions are largely absent. The authors propose LIMEN, a framework that achieves, for the first time, the joint automatic discovery of both observation and reward functions. LIMEN leverages a large language model to guide an evolutionary algorithm that generates executable programs from raw simulation states, iteratively refining the entire interface using policy training feedback. Experiments demonstrate that, given only trajectory-level success signals, LIMEN successfully discovers effective interfaces in both discrete grid-world and continuous control tasks. In contrast, optimizing either component in isolation fails completely in at least one domain, underscoring the necessity and superiority of co-design and substantially reducing the engineering cost of interface specification.

📝 Abstract

Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

task interface

observation mapping

reward function

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning interfaces

large language models

evolutionary framework