[slides and audio] UNETR%3A Transformers for 3D Medical Image Segmentation

The paper introduces UNETR, a novel transformer-based architecture for 3D medical image segmentation. Inspired by the success of transformers in natural language processing (NLP) for long-range sequence learning, UNETR reformulates the task of 3D medical image segmentation as a sequence-to-sequence prediction problem. The key contributions of UNETR include: 1. **Architecture**: UNETR uses a transformer encoder to learn sequence representations of 3D volumes, capturing global multi-scale information. The encoder is connected to a CNN-based decoder via skip connections at different resolutions to compute the final semantic segmentation output. 2. **Performance**: UNETR is validated on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. It achieves new state-of-the-art performance on the BTCV leaderboard and outperforms competing approaches on the MSD dataset. 3. **Methodology**: The transformer encoder divides the 3D input volume into uniform non-overlapping patches, projects them into an embedding space, and uses self-attention mechanisms to learn contextual information. The extracted representations are merged with the decoder via skip connections to predict the segmentation output. 4. **Loss Function**: The loss function combines soft dice loss and cross-entropy loss to evaluate the accuracy of segmentation. The paper also discusses related work, including CNN-based segmentation networks and vision transformers, and provides experimental results demonstrating the effectiveness of UNETR in various segmentation tasks.The paper introduces UNETR, a novel transformer-based architecture for 3D medical image segmentation. Inspired by the success of transformers in natural language processing (NLP) for long-range sequence learning, UNETR reformulates the task of 3D medical image segmentation as a sequence-to-sequence prediction problem. The key contributions of UNETR include: 1. **Architecture**: UNETR uses a transformer encoder to learn sequence representations of 3D volumes, capturing global multi-scale information. The encoder is connected to a CNN-based decoder via skip connections at different resolutions to compute the final semantic segmentation output. 2. **Performance**: UNETR is validated on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. It achieves new state-of-the-art performance on the BTCV leaderboard and outperforms competing approaches on the MSD dataset. 3. **Methodology**: The transformer encoder divides the 3D input volume into uniform non-overlapping patches, projects them into an embedding space, and uses self-attention mechanisms to learn contextual information. The extracted representations are merged with the decoder via skip connections to predict the segmentation output. 4. **Loss Function**: The loss function combines soft dice loss and cross-entropy loss to evaluate the accuracy of segmentation. The paper also discusses related work, including CNN-based segmentation networks and vision transformers, and provides experimental results demonstrating the effectiveness of UNETR in various segmentation tasks.

UNETR: Transformers for 3D Medical Image Segmentation

2021-10-09 | Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R. Roth, Daguang Xu