The paper addresses the challenge of image resolution variation in the Segment Anything Model (SAM), which is known for its zero-shot generalizability but exhibits performance degradation when faced with datasets of varying image sizes. Previous approaches often resize images or change patch sizes, which can hinder the preservation of SAM's rich prior knowledge and require complete retraining, making them costly and impractical for downstream tasks. To tackle this issue, the authors propose the Scalable Bias-Mode Attention Mask (BA-SAM), which enhances SAM's adaptability to varying image resolutions without requiring structural modifications.
BA-SAM introduces a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. It also presents a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Extensive evaluations on diverse datasets, including DISSK, DUTS, ISIC, COD10K, and COCO, demonstrate that BA-SAM significantly mitigates performance degradation in the zero-shot setting and achieves state-of-the-art performance with minimal fine-tuning.
The authors also propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. The method is evaluated in both zero-shot and fine-tuning scenarios, and its effectiveness is demonstrated through comprehensive experiments and ablation studies. The results show that BA-SAM outperforms both SAM and MobileSAM baselines, achieving superior performance on various object segmentation tasks.The paper addresses the challenge of image resolution variation in the Segment Anything Model (SAM), which is known for its zero-shot generalizability but exhibits performance degradation when faced with datasets of varying image sizes. Previous approaches often resize images or change patch sizes, which can hinder the preservation of SAM's rich prior knowledge and require complete retraining, making them costly and impractical for downstream tasks. To tackle this issue, the authors propose the Scalable Bias-Mode Attention Mask (BA-SAM), which enhances SAM's adaptability to varying image resolutions without requiring structural modifications.
BA-SAM introduces a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. It also presents a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Extensive evaluations on diverse datasets, including DISSK, DUTS, ISIC, COD10K, and COCO, demonstrate that BA-SAM significantly mitigates performance degradation in the zero-shot setting and achieves state-of-the-art performance with minimal fine-tuning.
The authors also propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. The method is evaluated in both zero-shot and fine-tuning scenarios, and its effectiveness is demonstrated through comprehensive experiments and ablation studies. The results show that BA-SAM outperforms both SAM and MobileSAM baselines, achieving superior performance on various object segmentation tasks.