๐ค AI Summary
Sparse and low-informative reward functions severely hinder training efficiency in long-horizon reinforcement learning tasks. To address this, we propose a runtime reward monitor grounded in quantitative Linear Temporal Logic over finite traces (LTL_f[F]), the first application of LTL_f[F] to reward synthesis. This approach overcomes the dual limitations of weak expressivity and reward sparsity inherent in conventional Boolean semantics, while enabling modeling of non-Markovian properties. Our method employs a state-labeling-function-driven, algorithm-agnostic framework that automatically synthesizes dense, interpretable, quantitative reward signals. Evaluated across diverse long-horizon decision-making benchmarks, it achieves substantial improvements in task success rates and accelerates convergence by an average of 37%โconsistently outperforming Boolean-reward baselines.
๐ Abstract
Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($ ext{LTL}_f[mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.