🤖 AI Summary
This work addresses the reliance on human-defined subtasks and annotated data in instruction-following tasks by proposing SuperIgor, a novel framework that enables the first co-learning paradigm between a language model and a reinforcement learning agent without any predefined subtasks. SuperIgor leverages the language model to autonomously generate high-level plans, which are executed by a goal-conditioned reinforcement learning agent, while preference-based feedback drives iterative refinement of these plans, establishing a closed-loop co-training mechanism. Experimental results demonstrate that SuperIgor substantially reduces dependence on human annotations, adheres more faithfully to instructions in complex dynamic environments, and exhibits strong generalization capabilities on unseen instructions.
📝 Abstract
We introduce SuperIgor, a framework for instruction-following tasks. Unlike prior methods that rely on predefined subtasks, SuperIgor enables a language model to generate and refine high-level plans through a self-learning mechanism, reducing the need for manual dataset annotation. Our approach involves iterative co-training: an RL agent is trained to follow the generated plans, while the language model adapts and modifies these plans based on RL feedback and preferences. This creates a feedback loop where both the agent and the planner improve jointly. We validate our framework in environments with rich dynamics and stochasticity. Results show that SuperIgor agents adhere to instructions more strictly than baseline methods, while also demonstrating strong generalization to previously unseen instructions.