[slides and audio] Off-Policy Evaluation in Markov Decision Processes under Weak Distributional Overlap

This paper revisits off-policy evaluation in Markov Decision Processes (MDPs) under a weaker distributional overlap assumption, which is more realistic in many practical scenarios. The authors introduce a class of truncated doubly robust (TDR) estimators that perform well when the distribution ratio of the target and data-collection policies is square-integrable. When this ratio is not square-integrable, TDR remains consistent but with a slower convergence rate. The paper also shows that TDR achieves the minimax rate of convergence for a class of off-policy evaluation problems characterized by mixing conditions. Numerical experiments validate the effectiveness of TDR, highlighting the importance of appropriate truncation in enabling accurate off-policy evaluation when strong distributional overlap does not hold.This paper revisits off-policy evaluation in Markov Decision Processes (MDPs) under a weaker distributional overlap assumption, which is more realistic in many practical scenarios. The authors introduce a class of truncated doubly robust (TDR) estimators that perform well when the distribution ratio of the target and data-collection policies is square-integrable. When this ratio is not square-integrable, TDR remains consistent but with a slower convergence rate. The paper also shows that TDR achieves the minimax rate of convergence for a class of off-policy evaluation problems characterized by mixing conditions. Numerical experiments validate the effectiveness of TDR, highlighting the importance of appropriate truncation in enabling accurate off-policy evaluation when strong distributional overlap does not hold.

Off-Policy Evaluation in Markov Decision Processes under Weak Distributional Overlap

February 14, 2024 | Mohammad Mehrabi, Stefan Wager