4 Oct 2024 | Feiran Zhao, Florian Dörfler, Alessandro Chiuso, Keyou You
This paper proposes a direct adaptive method for learning the Linear Quadratic Regulator (LQR) from online closed-loop data. The method, called Data-Enabled Policy Optimization (DeePO), directly updates the policy using a batch of persistently exciting (PE) data. The key contributions include a new policy parameterization based on sample covariance, which is shown to be equivalent to the certainty-equivalence LQR with optimal non-asymptotic guarantees. The method achieves global convergence via a projected gradient dominance property, enabling efficient online adaptation. The average regret of the LQR cost is upper-bounded by two terms: a sublinear decrease in time $ \mathcal{O}(1/\sqrt{T}) $ and a bias scaling inversely with signal-to-noise ratio (SNR). The method is validated through simulations, demonstrating its computational and sample efficiency. The approach is direct, online, and has an explicit recursive update of the policy, making it suitable for time-varying systems. The theoretical results show that DeePO achieves non-asymptotic guarantees independent of noise statistics, with a sublinear convergence rate matching first-order methods in online convex optimization. The method outperforms single batch methods in terms of sample efficiency and is suitable for adaptive control with online closed-loop data.This paper proposes a direct adaptive method for learning the Linear Quadratic Regulator (LQR) from online closed-loop data. The method, called Data-Enabled Policy Optimization (DeePO), directly updates the policy using a batch of persistently exciting (PE) data. The key contributions include a new policy parameterization based on sample covariance, which is shown to be equivalent to the certainty-equivalence LQR with optimal non-asymptotic guarantees. The method achieves global convergence via a projected gradient dominance property, enabling efficient online adaptation. The average regret of the LQR cost is upper-bounded by two terms: a sublinear decrease in time $ \mathcal{O}(1/\sqrt{T}) $ and a bias scaling inversely with signal-to-noise ratio (SNR). The method is validated through simulations, demonstrating its computational and sample efficiency. The approach is direct, online, and has an explicit recursive update of the policy, making it suitable for time-varying systems. The theoretical results show that DeePO achieves non-asymptotic guarantees independent of noise statistics, with a sublinear convergence rate matching first-order methods in online convex optimization. The method outperforms single batch methods in terms of sample efficiency and is suitable for adaptive control with online closed-loop data.