Perceiver: General Perception with Iterative Attention

Perceiver: General Perception with Iterative Attention

23 Jun 2021 | Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
Perceiver is a Transformer-based model designed to handle arbitrary configurations of different modalities. It uses an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. The model is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. It achieves performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet. The Perceiver architecture uses a cross-attention module to project a high-dimensional input byte array to a fixed-dimensional latent bottleneck before processing it using a deep stack of Transformer-style self-attention blocks in the latent space. The model iteratively attends to the input byte array by alternating cross-attention and latent self-attention blocks. This allows the model to channel its limited capacity to the most relevant inputs, informed by previous steps. The model can also be seen as performing a fully end-to-end clustering of the inputs with latent positions as cluster centres. The Perceiver architecture is designed to handle a wide range of inputs out of the box even if they come from very different modalities, including high-bandwidth ones such as images and audio. It is flexible and can be used on a diverse range of input data with essentially no architectural changes. The model uses Fourier features for position encoding, which allows it to represent the position structure of the input data while preserving 1D temporal or 2D spatial structure for audio or images, respectively, or 3D spatiotemporal structure for videos. The Perceiver is evaluated on ImageNet, AudioSet, and ModelNet40. On ImageNet, it achieves results competitive with models specifically designed for processing images. On AudioSet, it achieves results comparable to state-of-the-art models. On ModelNet40, it achieves results that are competitive with specialized models. The Perceiver is able to handle a wide range of inputs and is flexible in its design. It is a general perception model that can handle arbitrary sensor configurations and enables fusion of information at all levels.Perceiver is a Transformer-based model designed to handle arbitrary configurations of different modalities. It uses an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. The model is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. It achieves performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet. The Perceiver architecture uses a cross-attention module to project a high-dimensional input byte array to a fixed-dimensional latent bottleneck before processing it using a deep stack of Transformer-style self-attention blocks in the latent space. The model iteratively attends to the input byte array by alternating cross-attention and latent self-attention blocks. This allows the model to channel its limited capacity to the most relevant inputs, informed by previous steps. The model can also be seen as performing a fully end-to-end clustering of the inputs with latent positions as cluster centres. The Perceiver architecture is designed to handle a wide range of inputs out of the box even if they come from very different modalities, including high-bandwidth ones such as images and audio. It is flexible and can be used on a diverse range of input data with essentially no architectural changes. The model uses Fourier features for position encoding, which allows it to represent the position structure of the input data while preserving 1D temporal or 2D spatial structure for audio or images, respectively, or 3D spatiotemporal structure for videos. The Perceiver is evaluated on ImageNet, AudioSet, and ModelNet40. On ImageNet, it achieves results competitive with models specifically designed for processing images. On AudioSet, it achieves results comparable to state-of-the-art models. On ModelNet40, it achieves results that are competitive with specialized models. The Perceiver is able to handle a wide range of inputs and is flexible in its design. It is a general perception model that can handle arbitrary sensor configurations and enables fusion of information at all levels.
Reach us at info@study.space
[slides and audio] Perceiver%3A General Perception with Iterative Attention