21 May 2018 | Mohammadreza Nazari, Afshin Oroojlooy, Martin Takáč, Lawrence V. Snyder
This paper presents an end-to-end framework for solving the Vehicle Routing Problem (VRP) using reinforcement learning (RL). The approach trains a single model to find near-optimal solutions for problem instances sampled from a given distribution, using only reward signals and feasibility rules. The model represents a parameterized stochastic policy, and a policy gradient algorithm optimizes its parameters to produce solutions as a sequence of consecutive actions in real-time, without retraining for new problem instances. The method outperforms classical heuristics and Google’s OR-Tools on medium-sized capacitated VRP instances in terms of solution quality with comparable computation time. The framework can handle problems with split deliveries and is applicable to other VRP variants, such as the stochastic VRP, and potentially to broader combinatorial optimization problems. The paper includes background on sequence-to-sequence models and neural combinatorial optimization, and discusses the limitations of Pointer Networks. The proposed model consists of an RNN decoder and an attention mechanism, which allows the model to handle both static and dynamic elements of the system. Numerical experiments demonstrate the framework's superior performance compared to classical heuristics and OR-Tools, and its ability to handle stochastic VRP instances with changing customer demands.This paper presents an end-to-end framework for solving the Vehicle Routing Problem (VRP) using reinforcement learning (RL). The approach trains a single model to find near-optimal solutions for problem instances sampled from a given distribution, using only reward signals and feasibility rules. The model represents a parameterized stochastic policy, and a policy gradient algorithm optimizes its parameters to produce solutions as a sequence of consecutive actions in real-time, without retraining for new problem instances. The method outperforms classical heuristics and Google’s OR-Tools on medium-sized capacitated VRP instances in terms of solution quality with comparable computation time. The framework can handle problems with split deliveries and is applicable to other VRP variants, such as the stochastic VRP, and potentially to broader combinatorial optimization problems. The paper includes background on sequence-to-sequence models and neural combinatorial optimization, and discusses the limitations of Pointer Networks. The proposed model consists of an RNN decoder and an attention mechanism, which allows the model to handle both static and dynamic elements of the system. Numerical experiments demonstrate the framework's superior performance compared to classical heuristics and OR-Tools, and its ability to handle stochastic VRP instances with changing customer demands.