16 Nov 2017 | Steven J. Rennie1, Etienne Marcheret1, Youssef Mroueh, Jerret Ross and Vaibhava Goel1
This paper introduces a new approach called self-critical sequence training (SCST) for image captioning, which improves performance by directly optimizing non-differentiable metrics like CIDEr. SCST is a variant of the REINFORCE algorithm that uses the output of its own test-time inference algorithm to normalize rewards, avoiding the need for baselines and reducing variance. The method is applied to image captioning systems that use recurrent neural networks (RNNs) and attention mechanisms. The systems are trained to generate captions that match the test-time inference procedure, leading to better generalization and performance. The paper evaluates the method on the MSCOCO dataset and shows that SCST significantly improves performance, achieving a new state-of-the-art result in CIDEr scores. The approach is effective when combined with greedy decoding at test time and outperforms other methods like MIXER and REINFORCE with baselines. The results demonstrate that SCST can be trained efficiently with mini-batches and stochastic gradient descent, and that it leads to more accurate and detailed captions, especially for images with objects in uncommon contexts. The paper also discusses the benefits of using attention models and ensembling multiple models to further improve performance.This paper introduces a new approach called self-critical sequence training (SCST) for image captioning, which improves performance by directly optimizing non-differentiable metrics like CIDEr. SCST is a variant of the REINFORCE algorithm that uses the output of its own test-time inference algorithm to normalize rewards, avoiding the need for baselines and reducing variance. The method is applied to image captioning systems that use recurrent neural networks (RNNs) and attention mechanisms. The systems are trained to generate captions that match the test-time inference procedure, leading to better generalization and performance. The paper evaluates the method on the MSCOCO dataset and shows that SCST significantly improves performance, achieving a new state-of-the-art result in CIDEr scores. The approach is effective when combined with greedy decoding at test time and outperforms other methods like MIXER and REINFORCE with baselines. The results demonstrate that SCST can be trained efficiently with mini-batches and stochastic gradient descent, and that it leads to more accurate and detailed captions, especially for images with objects in uncommon contexts. The paper also discusses the benefits of using attention models and ensembling multiple models to further improve performance.