🤖 AI Summary
This work proposes a model-free output feedback stabilization algorithm for discrete-time linear systems with unknown dynamics and access only to output measurements rather than full state observations. The approach introduces, for the first time, zeroth-order policy gradient updates into the output feedback control setting, enabling direct optimization of the controller policy using system trajectories and thereby circumventing the reliance on full-state feedback inherent in conventional policy gradient methods. Theoretical analysis establishes that the algorithm converges to a stationary point corresponding to a stabilizing policy and provides an explicit bound on its sample complexity. Numerical experiments further corroborate the efficacy of the proposed method.
📝 Abstract
Stabilizing a dynamical system is a fundamental problem that serves as a cornerstone for many complex tasks in the field of control systems. The problem becomes challenging when the system model is unknown. Among the Reinforcement Learning (RL) algorithms that have been successfully applied to solve problems pertaining to unknown linear dynamical systems, the policy gradient (PG) method stands out due to its ease of implementation and can solve the problem in a model-free manner. However, most of the existing works on PG methods for unknown linear dynamical systems assume full-state feedback. In this paper, we take a step towards model-free learning for partially observable linear dynamical systems with output feedback and focus on the fundamental stabilization problem of the system. We propose an algorithmic framework that stretches the boundary of PG methods to the problem without global convergence guarantees. We show that by leveraging zeroth-order PG update based on system trajectories and its convergence to stationary points, the proposed algorithms return a stabilizing output feedback policy for discrete-time linear dynamical systems. We also explicitly characterize the sample complexity of our algorithm and verify the effectiveness of the algorithm using numerical examples.