🤖 AI Summary
This work addresses the limitation in linear reinforcement learning that requires handcrafted features or predefined kernels by proposing an automatic feature learning approach based on the Koopman operator. The authors reformulate Least-Squares Policy Iteration (LSPI) into the framework of Extended Dynamic Mode Decomposition (EDMD) and integrate it within a Koopman Autoencoder (KAE) architecture, thereby achieving the first end-to-end coupling of LSPI with the Koopman operator without the need for pre-specified features or kernels. Experimental results on stochastic chain-walk and inverted pendulum control tasks demonstrate that the method automatically learns a reasonable number of features and achieves policy convergence performance comparable to classical LSPI and kernel-based methods, confirming its effectiveness and practicality.
📝 Abstract
In this paper, we present a Koopman autoencoder-based least-squares policy iteration (KAE-LSPI) algorithm in reinforcement learning (RL). The KAE-LSPI algorithm is based on reformulating the so-called least-squares fixed-point approximation method in terms of extended dynamic mode decomposition (EDMD), thereby enabling automatic feature learning via the Koopman autoencoder (KAE) framework. The approach is motivated by the lack of a systematic choice of features or kernels in linear RL techniques. We compare the KAE-LSPI algorithm with two previous works, the classical least-squares policy iteration (LSPI) and the kernel-based least-squares policy iteration (KLSPI), using stochastic chain walk and inverted pendulum control problems as examples. Unlike previous works, no features or kernels need to be fixed a priori in our approach. Empirical results show the number of features learned by the KAE technique remains reasonable compared to those fixed in the classical LSPI algorithm. The convergence to an optimal or a near-optimal policy is also comparable to the other two methods.