Discrete Probabilistic Inference as Control in Multi-path Environments

Discrete Probabilistic Inference as Control in Multi-path Environments

27 May 2024 | Tristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, Yoshua Bengio
This paper explores the problem of sampling from discrete and structured distributions by framing it as a sequential decision-making process. The authors address the issue of bias in the distribution induced by the optimal policy in Maximum Entropy Reinforcement Learning (MaxEnt RL) when there are multiple ways to generate the same object. They introduce Generative Flow Networks (GFlowNets), which learn a stochastic policy that samples objects proportional to their rewards by enforcing a conservation of flows across the Markov Decision Process (MDP). The paper extends recent methods to correct the reward function, ensuring that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the MDP structure. The authors also prove that some flow-matching objectives in the GFlowNet literature are equivalent to well-established MaxEnt RL algorithms with corrected rewards. Empirical studies on various problems involving sampling from discrete distributions demonstrate the performance of multiple MaxEnt RL and GFlowNet algorithms. The paper highlights the connections between GFlowNets and MaxEnt RL, providing a unified perspective on probabilistic inference over large-scale discrete and structured spaces.This paper explores the problem of sampling from discrete and structured distributions by framing it as a sequential decision-making process. The authors address the issue of bias in the distribution induced by the optimal policy in Maximum Entropy Reinforcement Learning (MaxEnt RL) when there are multiple ways to generate the same object. They introduce Generative Flow Networks (GFlowNets), which learn a stochastic policy that samples objects proportional to their rewards by enforcing a conservation of flows across the Markov Decision Process (MDP). The paper extends recent methods to correct the reward function, ensuring that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the MDP structure. The authors also prove that some flow-matching objectives in the GFlowNet literature are equivalent to well-established MaxEnt RL algorithms with corrected rewards. Empirical studies on various problems involving sampling from discrete distributions demonstrate the performance of multiple MaxEnt RL and GFlowNet algorithms. The paper highlights the connections between GFlowNets and MaxEnt RL, providing a unified perspective on probabilistic inference over large-scale discrete and structured spaces.
Reach us at info@study.space