27 May 2024 | Tristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, Yoshua Bengio
This paper presents a study on discrete probabilistic inference as control in multi-path environments, focusing on the equivalence between Generative Flow Networks (GFlowNets) and Maximum Entropy Reinforcement Learning (MaxEnt RL). The authors show that GFlowNets can be viewed as a form of MaxEnt RL with a corrected reward function, and that several GFlowNet objectives are equivalent to well-known MaxEnt RL algorithms. They demonstrate that by correcting the reward function, the terminating state distribution induced by the optimal policy in MaxEnt RL becomes proportional to the original reward, regardless of the structure of the underlying Markov Decision Process (MDP). This correction allows for the equivalence between GFlowNet objectives such as Trajectory Balance and Path Consistency Learning, and MaxEnt RL algorithms. The authors also show that certain GFlowNet objectives, such as the Modified Detailed Balance loss, are equivalent to the Soft Q-Learning algorithm. Empirical results on multiple tasks, including probabilistic inference over discrete factor graphs, Bayesian structure learning, and phylogenetic tree generation, validate these equivalences. The study highlights the importance of reward correction in ensuring that the terminating state distribution matches the desired Gibbs distribution, and shows that the equivalence between GFlowNets and MaxEnt RL can be extended to more general cases. The paper also discusses the implications of these findings for future research in probabilistic inference and reinforcement learning.This paper presents a study on discrete probabilistic inference as control in multi-path environments, focusing on the equivalence between Generative Flow Networks (GFlowNets) and Maximum Entropy Reinforcement Learning (MaxEnt RL). The authors show that GFlowNets can be viewed as a form of MaxEnt RL with a corrected reward function, and that several GFlowNet objectives are equivalent to well-known MaxEnt RL algorithms. They demonstrate that by correcting the reward function, the terminating state distribution induced by the optimal policy in MaxEnt RL becomes proportional to the original reward, regardless of the structure of the underlying Markov Decision Process (MDP). This correction allows for the equivalence between GFlowNet objectives such as Trajectory Balance and Path Consistency Learning, and MaxEnt RL algorithms. The authors also show that certain GFlowNet objectives, such as the Modified Detailed Balance loss, are equivalent to the Soft Q-Learning algorithm. Empirical results on multiple tasks, including probabilistic inference over discrete factor graphs, Bayesian structure learning, and phylogenetic tree generation, validate these equivalences. The study highlights the importance of reward correction in ensuring that the terminating state distribution matches the desired Gibbs distribution, and shows that the equivalence between GFlowNets and MaxEnt RL can be extended to more general cases. The paper also discusses the implications of these findings for future research in probabilistic inference and reinforcement learning.