[slides] Perceiver%3A General Perception with Iterative Attention

The paper introduces the Perceiver, a Transformer-based model designed to handle a wide range of modalities and input configurations without making domain-specific assumptions. Unlike traditional deep learning models that are tailored for specific modalities, such as ConvNets for images or LSTMs for audio, the Perceiver leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to very large inputs. This approach decouples the network depth from the input size, enabling the construction of very deep models. The Perceiver uses cross-attention to project high-dimensional inputs to a fixed-dimensional latent space, followed by a stack of Transformer-style self-attention blocks in the latent space. This design allows the model to handle various modalities, including images, point clouds, audio, and video, with competitive or superior performance compared to specialized models like ResNet-50 and ViT on tasks such as classification on ImageNet and AudioSet. The Perceiver also demonstrates strong performance on point cloud classification on ModelNet40. The paper discusses the architecture's design, position encodings, and experimental results, highlighting its flexibility and generalizability.The paper introduces the Perceiver, a Transformer-based model designed to handle a wide range of modalities and input configurations without making domain-specific assumptions. Unlike traditional deep learning models that are tailored for specific modalities, such as ConvNets for images or LSTMs for audio, the Perceiver leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to very large inputs. This approach decouples the network depth from the input size, enabling the construction of very deep models. The Perceiver uses cross-attention to project high-dimensional inputs to a fixed-dimensional latent space, followed by a stack of Transformer-style self-attention blocks in the latent space. This design allows the model to handle various modalities, including images, point clouds, audio, and video, with competitive or superior performance compared to specialized models like ResNet-50 and ViT on tasks such as classification on ImageNet and AudioSet. The Perceiver also demonstrates strong performance on point cloud classification on ModelNet40. The paper discusses the architecture's design, position encodings, and experimental results, highlighting its flexibility and generalizability.

Perceiver: General Perception with Iterative Attention

23 Jun 2021 | Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira