Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

27 Mar 2024 | Zhiheng Cheng, Qingyue Wei, Hongru Zhu, Yan Wang, Liangqiong Qu, Wei Shao, Yuyin Zhou
This paper introduces H-SAM, a prompt-free adaptation of the Segment Anything Model (SAM) tailored for efficient fine-tuning of medical images via a two-stage hierarchical decoding procedure. H-SAM addresses the challenges of applying SAM to medical imaging, where traditional methods require extensive training data or high-quality prompts. The proposed method integrates medical knowledge through a streamlined two-stage hierarchical mask decoder while keeping the image encoder frozen. In the first stage, H-SAM uses SAM's original lightweight mask decoder to generate a prior probabilistic mask, which guides a more intricate decoding process in the second stage. Two key designs are introduced: 1) a class-balanced, mask-guided self-attention mechanism that addresses unbalanced label distribution and enhances image embedding; and 2) a learnable mask cross-attention mechanism that spatially modulates the interplay among different image regions based on the prior mask. Additionally, a hierarchical pixel decoder is incorporated to enhance the model's ability to capture fine-grained and localized details. H-SAM demonstrates a 4.78% improvement in average Dice compared to existing prompt-free SAM variants for multi-organ segmentation using only 10% of 2D slices. Notably, without using any unlabeled data, H-SAM even outperforms state-of-the-art semi-supervised models relying on extensive unlabeled training data across various medical datasets. The model achieves promising performance on the prostate and left atrial segmentation using only 3 and 4 cases, respectively. The paper also presents ablation studies showing that the learnable mask cross-attention and CMAttn mechanisms significantly improve performance. Additionally, the hierarchical pixel decoder contributes to the model's effectiveness in capturing detailed segmentation. The efficiency analysis shows that H-SAM achieves better performance with fewer parameters compared to other prompt-free SAM variants. Qualitative results further demonstrate that H-SAM provides precise mask predictions with less noise and correctly attributes each organ to its category, even in challenging cases. Overall, H-SAM offers a robust, efficient, and data-economic solution for medical image segmentation, demonstrating significant potential in advancing the field.This paper introduces H-SAM, a prompt-free adaptation of the Segment Anything Model (SAM) tailored for efficient fine-tuning of medical images via a two-stage hierarchical decoding procedure. H-SAM addresses the challenges of applying SAM to medical imaging, where traditional methods require extensive training data or high-quality prompts. The proposed method integrates medical knowledge through a streamlined two-stage hierarchical mask decoder while keeping the image encoder frozen. In the first stage, H-SAM uses SAM's original lightweight mask decoder to generate a prior probabilistic mask, which guides a more intricate decoding process in the second stage. Two key designs are introduced: 1) a class-balanced, mask-guided self-attention mechanism that addresses unbalanced label distribution and enhances image embedding; and 2) a learnable mask cross-attention mechanism that spatially modulates the interplay among different image regions based on the prior mask. Additionally, a hierarchical pixel decoder is incorporated to enhance the model's ability to capture fine-grained and localized details. H-SAM demonstrates a 4.78% improvement in average Dice compared to existing prompt-free SAM variants for multi-organ segmentation using only 10% of 2D slices. Notably, without using any unlabeled data, H-SAM even outperforms state-of-the-art semi-supervised models relying on extensive unlabeled training data across various medical datasets. The model achieves promising performance on the prostate and left atrial segmentation using only 3 and 4 cases, respectively. The paper also presents ablation studies showing that the learnable mask cross-attention and CMAttn mechanisms significantly improve performance. Additionally, the hierarchical pixel decoder contributes to the model's effectiveness in capturing detailed segmentation. The efficiency analysis shows that H-SAM achieves better performance with fewer parameters compared to other prompt-free SAM variants. Qualitative results further demonstrate that H-SAM provides precise mask predictions with less noise and correctly attributes each organ to its category, even in challenging cases. Overall, H-SAM offers a robust, efficient, and data-economic solution for medical image segmentation, demonstrating significant potential in advancing the field.
Reach us at info@study.space
Understanding Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding