Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR

4 Oct 2024 | Feiran Zhao, Florian Dörfler, Alessandro Chiuso, Keyou You
This paper proposes a direct adaptive method to learn the Linear Quadratic Regulator (LQR) from online closed-loop data. The key contributions are: 1. **New Policy Parameterization**: A novel policy parameterization based on sample covariance is introduced, which is equivalent to the certainty-equivalence LQR and has a constant dimension independent of the data length. This parameterization enables efficient use of data and implicit regularization. 2. **DeePO Method**: A data-enabled policy optimization (DeePO) method is designed to update the policy directly from online closed-loop data. The gradient is computed using only a batch of persistently exciting (PE) data, and the method is shown to converge globally via a projected gradient dominance property. 3. **Adaptive Learning**: DeePO is used to adaptively learn the optimal LQR gain from online closed-loop data. The approach is direct, online, and has an explicit recursive update of the policy. It can be extended to time-varying systems by adding a forgetting factor to the covariance parameterization. 4. **Non-Asymptotic Guarantees**: Non-asymptotic convergence guarantees are provided for DeePO, showing that the average regret of the LQR cost is upper-bounded by two terms: a sublinear decrease in time \(\mathcal{O}(1/\sqrt{T})\) and a bias scaling inversely with the signal-to-noise ratio (SNR). These results improve over single-batch methods and align with the convergence rates of first-order methods in online convex optimization. 5. **Computational and Sample Efficiency**: Simulations validate the global convergence of DeePO and demonstrate its computational and sample efficiency compared to indirect adaptive approaches and zeroth-order policy optimization (PO). Overall, the proposed method provides a promising solution to the open problem of direct and online adaptive control of the LQR, achieving efficient and recursive policy updates from online closed-loop data.This paper proposes a direct adaptive method to learn the Linear Quadratic Regulator (LQR) from online closed-loop data. The key contributions are: 1. **New Policy Parameterization**: A novel policy parameterization based on sample covariance is introduced, which is equivalent to the certainty-equivalence LQR and has a constant dimension independent of the data length. This parameterization enables efficient use of data and implicit regularization. 2. **DeePO Method**: A data-enabled policy optimization (DeePO) method is designed to update the policy directly from online closed-loop data. The gradient is computed using only a batch of persistently exciting (PE) data, and the method is shown to converge globally via a projected gradient dominance property. 3. **Adaptive Learning**: DeePO is used to adaptively learn the optimal LQR gain from online closed-loop data. The approach is direct, online, and has an explicit recursive update of the policy. It can be extended to time-varying systems by adding a forgetting factor to the covariance parameterization. 4. **Non-Asymptotic Guarantees**: Non-asymptotic convergence guarantees are provided for DeePO, showing that the average regret of the LQR cost is upper-bounded by two terms: a sublinear decrease in time \(\mathcal{O}(1/\sqrt{T})\) and a bias scaling inversely with the signal-to-noise ratio (SNR). These results improve over single-batch methods and align with the convergence rates of first-order methods in online convex optimization. 5. **Computational and Sample Efficiency**: Simulations validate the global convergence of DeePO and demonstrate its computational and sample efficiency compared to indirect adaptive approaches and zeroth-order policy optimization (PO). Overall, the proposed method provides a promising solution to the open problem of direct and online adaptive control of the LQR, achieving efficient and recursive policy updates from online closed-loop data.
Reach us at info@study.space