31 January 2024 | Wenfeng Zheng, Siyu Lu, Youshuai Yang, Zhengtong Yin and Lirong Yin
This study addresses the quadratic complexity issue of Transformer models in image feature extraction, which hinders their ability to process high-resolution images and increases computational costs. To tackle this, two approaches are proposed: a linear attention mechanism and a parameter-less lightweight pruning method. The linear attention mechanism reduces the complexity of the self-attention mechanism from quadratic to linear by approximating the Softmax operator using a combination function that ensures non-negativity and non-linear reweighting. The pruning method adaptively samples input images to filter out unimportant tokens, reducing irrelevant input. These methods are combined to create an efficient attention mechanism (e-attention). Experimental results on the ImageNet1k and COCO datasets demonstrate that the combined methods reduce computation by 30%–50% for the linear attention mechanism and 60%–70% for the e-attention mechanism, while maintaining or slightly improving model performance. The study highlights the effectiveness of these techniques in accelerating Transformer models for image feature extraction.This study addresses the quadratic complexity issue of Transformer models in image feature extraction, which hinders their ability to process high-resolution images and increases computational costs. To tackle this, two approaches are proposed: a linear attention mechanism and a parameter-less lightweight pruning method. The linear attention mechanism reduces the complexity of the self-attention mechanism from quadratic to linear by approximating the Softmax operator using a combination function that ensures non-negativity and non-linear reweighting. The pruning method adaptively samples input images to filter out unimportant tokens, reducing irrelevant input. These methods are combined to create an efficient attention mechanism (e-attention). Experimental results on the ImageNet1k and COCO datasets demonstrate that the combined methods reduce computation by 30%–50% for the linear attention mechanism and 60%–70% for the e-attention mechanism, while maintaining or slightly improving model performance. The study highlights the effectiveness of these techniques in accelerating Transformer models for image feature extraction.