14 Oct 2021 | NICO ENGEL, VASILEIOS BELAGIANNIS, KLAUS DIETMAYER
Point Transformer is a deep neural network that directly processes unordered and unstructured point sets. It extracts local and global features and relates them using a local-global attention mechanism to capture spatial relations and shape information. To achieve permutation invariance, SortNet is introduced, which selects points based on a learned score to generate a sorted and permutation invariant feature list. This feature list can be directly used in computer vision tasks. The network is evaluated on standard benchmarks for classification and part segmentation, showing competitive results compared to prior work. The code is publicly available.
The network is designed to handle 3D point sets by using a multi-head attention mechanism, which is adapted for point processing. SortNet is a key component that induces permutation invariance by selecting points based on a learned score. The output of SortNet is used to generate local features, which are then related to global features using local-global attention. This allows the network to capture geometric relations and shape information, resulting in a permutation invariant and ordered feature representation.
The Point Transformer architecture consists of two branches: a local feature generation module (SortNet) and a global feature extraction network. The local branch generates ordered local features, while the global branch extracts global features. These features are then combined using local-global attention to produce a permutation invariant and ordered representation. This representation can be used for various visual tasks such as shape classification and part segmentation.
The local feature generation module, SortNet, is responsible for extracting ordered local features from different subspaces. It uses a self-multi-head attention layer to capture spatial and higher-order relations between points. A row-wise feed-forward network is used to reduce the feature dimension to one, creating a learnable scalar score for each input point. The top-K points with the highest scores are selected and grouped based on their proximity to create local features. These features are then concatenated with the scores and local features to form the local feature vector.
The global feature generation branch uses multi-scale grouping to reduce the number of points while aggregating spatial information. The global features are then related to the local features using local-global attention, which allows the network to capture shape and context information. The output of the network is a permutation invariant and ordered feature representation that can be directly used in computer vision tasks.
The Point Transformer is evaluated on the ModelNet40 dataset for classification and the ShapeNet dataset for part segmentation. It outperforms attention-based methods and achieves competitive results compared to state-of-the-art methods. The network is also compared to other approaches in terms of computational complexity and inference time. The results show that Point Transformer is efficient and effective for processing 3D point sets.Point Transformer is a deep neural network that directly processes unordered and unstructured point sets. It extracts local and global features and relates them using a local-global attention mechanism to capture spatial relations and shape information. To achieve permutation invariance, SortNet is introduced, which selects points based on a learned score to generate a sorted and permutation invariant feature list. This feature list can be directly used in computer vision tasks. The network is evaluated on standard benchmarks for classification and part segmentation, showing competitive results compared to prior work. The code is publicly available.
The network is designed to handle 3D point sets by using a multi-head attention mechanism, which is adapted for point processing. SortNet is a key component that induces permutation invariance by selecting points based on a learned score. The output of SortNet is used to generate local features, which are then related to global features using local-global attention. This allows the network to capture geometric relations and shape information, resulting in a permutation invariant and ordered feature representation.
The Point Transformer architecture consists of two branches: a local feature generation module (SortNet) and a global feature extraction network. The local branch generates ordered local features, while the global branch extracts global features. These features are then combined using local-global attention to produce a permutation invariant and ordered representation. This representation can be used for various visual tasks such as shape classification and part segmentation.
The local feature generation module, SortNet, is responsible for extracting ordered local features from different subspaces. It uses a self-multi-head attention layer to capture spatial and higher-order relations between points. A row-wise feed-forward network is used to reduce the feature dimension to one, creating a learnable scalar score for each input point. The top-K points with the highest scores are selected and grouped based on their proximity to create local features. These features are then concatenated with the scores and local features to form the local feature vector.
The global feature generation branch uses multi-scale grouping to reduce the number of points while aggregating spatial information. The global features are then related to the local features using local-global attention, which allows the network to capture shape and context information. The output of the network is a permutation invariant and ordered feature representation that can be directly used in computer vision tasks.
The Point Transformer is evaluated on the ModelNet40 dataset for classification and the ShapeNet dataset for part segmentation. It outperforms attention-based methods and achieves competitive results compared to state-of-the-art methods. The network is also compared to other approaches in terms of computational complexity and inference time. The results show that Point Transformer is efficient and effective for processing 3D point sets.