Off-Policy Evaluation in Markov Decision Processes under Weak Distributional Overlap

Off-Policy Evaluation in Markov Decision Processes under Weak Distributional Overlap

February 14, 2024 | Mohammad Mehrabi, Stefan Wager
This paper investigates off-policy evaluation in Markov decision processes (MDPs) under a weaker distributional overlap assumption. The authors introduce a class of truncated doubly robust (TDR) estimators that perform well in this setting. Under the assumption that the distribution ratio of the target and data-collection policies is square-integrable, the TDR estimator achieves similar guarantees as the standard doubly robust (DR) estimator under strong distributional overlap. When the distribution ratio is not square-integrable, the TDR estimator is still consistent but with a slower convergence rate. The authors also show that TDR achieves the minimax rate of convergence for a class of MDPs defined only using mixing conditions. Numerical experiments validate the effectiveness of TDR, showing that appropriate truncation is crucial when the distribution ratio takes on large values. The paper also discusses related work and provides theoretical guarantees for the TDR estimator in both discounted and long-run average reward settings.This paper investigates off-policy evaluation in Markov decision processes (MDPs) under a weaker distributional overlap assumption. The authors introduce a class of truncated doubly robust (TDR) estimators that perform well in this setting. Under the assumption that the distribution ratio of the target and data-collection policies is square-integrable, the TDR estimator achieves similar guarantees as the standard doubly robust (DR) estimator under strong distributional overlap. When the distribution ratio is not square-integrable, the TDR estimator is still consistent but with a slower convergence rate. The authors also show that TDR achieves the minimax rate of convergence for a class of MDPs defined only using mixing conditions. Numerical experiments validate the effectiveness of TDR, showing that appropriate truncation is crucial when the distribution ratio takes on large values. The paper also discusses related work and provides theoretical guarantees for the TDR estimator in both discounted and long-run average reward settings.
Reach us at info@study.space