The paper introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. Adam is designed to be computationally efficient, have low memory requirements, and be invariant to diagonal rescaling of gradients. It combines the advantages of AdaGrad and RMSProp, making it suitable for large datasets and high-dimensional parameter spaces. The method updates individual adaptive learning rates for different parameters based on estimates of first and second moments of the gradients. The paper discusses the intuition behind the hyperparameters, provides a theoretical analysis of convergence, and demonstrates empirical results showing that Adam performs well in practice, outperforming other stochastic optimization methods in various models and datasets. Additionally, the paper introduces AdaMax, a variant of Adam based on the infinity norm, which offers a simpler bound on the magnitude of parameter updates.The paper introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. Adam is designed to be computationally efficient, have low memory requirements, and be invariant to diagonal rescaling of gradients. It combines the advantages of AdaGrad and RMSProp, making it suitable for large datasets and high-dimensional parameter spaces. The method updates individual adaptive learning rates for different parameters based on estimates of first and second moments of the gradients. The paper discusses the intuition behind the hyperparameters, provides a theoretical analysis of convergence, and demonstrates empirical results showing that Adam performs well in practice, outperforming other stochastic optimization methods in various models and datasets. Additionally, the paper introduces AdaMax, a variant of Adam based on the infinity norm, which offers a simpler bound on the magnitude of parameter updates.