🤖 AI Summary
This work addresses the challenge in inverse reinforcement learning (IRL) that arises when demonstrations come from multiple suboptimal and heterogeneous demonstrators, rather than a single optimal expert, making accurate reward recovery difficult. The authors propose a novel framework based on feasible reward sets: for each demonstrator, a set of linear constraints is constructed using their declared level of suboptimality, and the true reward function is jointly inferred by intersecting all individual feasible sets. Theoretically, they prove that this feasible set monotonically shrinks with more data and provide necessary and sufficient conditions under which a new demonstrator strictly tightens the set. Moreover, they establish two reward recovery guarantees that do not require near-optimal demonstrations. An offline algorithm incorporating function approximation is developed, enabling application to high-dimensional settings such as large language model fine-tuning. Experiments on grid-world tasks and LLM fine-tuning demonstrate significant improvements over baselines in both reward recovery accuracy and policy performance.
📝 Abstract
Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.