15 Jun 2018 | Niki Parmar *1 Ashish Vaswani *1 Jakob Uszkoreit 1 Łukasz Kaiser1 Noam Shazeer 1 Alexander Ku 2 3 Dustin Tran 4
The paper introduces the Image Transformer, a model architecture based on self-attention that generalizes the Transformer model for sequence modeling to image generation. By restricting the self-attention mechanism to attend to local neighborhoods, the model can process larger images while maintaining a significantly larger receptive field compared to typical convolutional neural networks. The Image Transformer outperforms existing state-of-the-art models in image generation on the ImageNet dataset, improving the negative log-likelihood from 3.83 to 3.77. The model also demonstrates superior performance in image super-resolution tasks, as shown in a human evaluation study where it fooled human observers three times more often than previous methods. The paper discusses the architecture, training, and experimental results, highlighting the effectiveness of local self-attention in balancing receptive field size and computational efficiency.The paper introduces the Image Transformer, a model architecture based on self-attention that generalizes the Transformer model for sequence modeling to image generation. By restricting the self-attention mechanism to attend to local neighborhoods, the model can process larger images while maintaining a significantly larger receptive field compared to typical convolutional neural networks. The Image Transformer outperforms existing state-of-the-art models in image generation on the ImageNet dataset, improving the negative log-likelihood from 3.83 to 3.77. The model also demonstrates superior performance in image super-resolution tasks, as shown in a human evaluation study where it fooled human observers three times more often than previous methods. The paper discusses the architecture, training, and experimental results, highlighting the effectiveness of local self-attention in balancing receptive field size and computational efficiency.