2021-10-09 | Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R. Roth, Daguang Xu
The paper introduces UNETR, a novel transformer-based architecture for 3D medical image segmentation. Inspired by the success of transformers in natural language processing (NLP) for long-range sequence learning, UNETR reformulates the task of 3D medical image segmentation as a sequence-to-sequence prediction problem. The key contributions of UNETR include:
1. **Architecture**: UNETR uses a transformer encoder to learn sequence representations of 3D volumes, capturing global multi-scale information. The encoder is connected to a CNN-based decoder via skip connections at different resolutions to compute the final semantic segmentation output.
2. **Performance**: UNETR is validated on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. It achieves new state-of-the-art performance on the BTCV leaderboard and outperforms competing approaches on the MSD dataset.
3. **Methodology**: The transformer encoder divides the 3D input volume into uniform non-overlapping patches, projects them into an embedding space, and uses self-attention mechanisms to learn contextual information. The extracted representations are merged with the decoder via skip connections to predict the segmentation output.
4. **Loss Function**: The loss function combines soft dice loss and cross-entropy loss to evaluate the accuracy of segmentation.
The paper also discusses related work, including CNN-based segmentation networks and vision transformers, and provides experimental results demonstrating the effectiveness of UNETR in various segmentation tasks.The paper introduces UNETR, a novel transformer-based architecture for 3D medical image segmentation. Inspired by the success of transformers in natural language processing (NLP) for long-range sequence learning, UNETR reformulates the task of 3D medical image segmentation as a sequence-to-sequence prediction problem. The key contributions of UNETR include:
1. **Architecture**: UNETR uses a transformer encoder to learn sequence representations of 3D volumes, capturing global multi-scale information. The encoder is connected to a CNN-based decoder via skip connections at different resolutions to compute the final semantic segmentation output.
2. **Performance**: UNETR is validated on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. It achieves new state-of-the-art performance on the BTCV leaderboard and outperforms competing approaches on the MSD dataset.
3. **Methodology**: The transformer encoder divides the 3D input volume into uniform non-overlapping patches, projects them into an embedding space, and uses self-attention mechanisms to learn contextual information. The extracted representations are merged with the decoder via skip connections to predict the segmentation output.
4. **Loss Function**: The loss function combines soft dice loss and cross-entropy loss to evaluate the accuracy of segmentation.
The paper also discusses related work, including CNN-based segmentation networks and vision transformers, and provides experimental results demonstrating the effectiveness of UNETR in various segmentation tasks.