Matching Anything by Segmenting Anything

Matching Anything by Segmenting Anything

6 Jun 2024 | Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, Fisher Yu
The paper introduces MASA (Matching Anything by Segmenting Anything), a novel method for robust instance association learning in videos across diverse domains without the need for tracking labels. MASA leverages the rich object segmentation capabilities of the Segment Anything Model (SAM) to learn instance-level correspondence through exhaustive data transformations. The method treats SAM outputs as dense object region proposals and learns to match these regions from a vast collection of unlabeled images. Additionally, a universal MASA adapter is designed to work with foundational segmentation or detection models, enabling them to track any detected objects with strong zero-shot tracking ability in complex domains. Extensive experiments on multiple challenging benchmarks, including TAO MOT, Open-vocabulary MOT, BDD100K MOTS, and UVO, demonstrate that MASA achieves superior performance compared to state-of-the-art methods trained with fully annotated in-domain video sequences, even in zero-shot association settings. The code for MASA is available at [github.com/siyuanliii/masa].The paper introduces MASA (Matching Anything by Segmenting Anything), a novel method for robust instance association learning in videos across diverse domains without the need for tracking labels. MASA leverages the rich object segmentation capabilities of the Segment Anything Model (SAM) to learn instance-level correspondence through exhaustive data transformations. The method treats SAM outputs as dense object region proposals and learns to match these regions from a vast collection of unlabeled images. Additionally, a universal MASA adapter is designed to work with foundational segmentation or detection models, enabling them to track any detected objects with strong zero-shot tracking ability in complex domains. Extensive experiments on multiple challenging benchmarks, including TAO MOT, Open-vocabulary MOT, BDD100K MOTS, and UVO, demonstrate that MASA achieves superior performance compared to state-of-the-art methods trained with fully annotated in-domain video sequences, even in zero-shot association settings. The code for MASA is available at [github.com/siyuanliii/masa].
Reach us at info@study.space