6 Jun 2024 | Siyuan Li¹, Lei Ke¹, Martin Danelljan¹, Luigi Piccinelli¹, Mattia Segu¹, Luc Van Gool¹,², Fisher Yu¹
MASA is a novel method for robust instance association learning that enables matching any objects within videos across diverse domains without tracking labels. It leverages the rich object segmentation from the Segment Anything Model (SAM) to learn instance-level correspondence through exhaustive data transformations. MASA treats SAM outputs as dense object region proposals and learns to match those regions from a vast image collection. It further designs a universal MASA adapter that can work with foundational segmentation or detection models, enabling them to track any detected objects. These combinations show strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that MASA, using only unlabeled static images, achieves better performance than state-of-the-art methods trained with fully annotated in-domain video sequences in zero-shot association. The MASA pipeline includes a MASA adapter that transforms features from frozen detection or segmentation backbones to learn generalizable instance appearance representations. It also constructs a unified model to jointly detect/segment and track anything. MASA outperforms existing methods on various benchmarks, including TAO MOT, Open-vocabulary MOT, BDD100K MOTS, and UVO. It achieves strong zero-shot association performance by learning discriminative instance embeddings from unlabeled images. The MASA adapter enables existing models to track any detected objects, and the method demonstrates strong generalization across diverse domains.MASA is a novel method for robust instance association learning that enables matching any objects within videos across diverse domains without tracking labels. It leverages the rich object segmentation from the Segment Anything Model (SAM) to learn instance-level correspondence through exhaustive data transformations. MASA treats SAM outputs as dense object region proposals and learns to match those regions from a vast image collection. It further designs a universal MASA adapter that can work with foundational segmentation or detection models, enabling them to track any detected objects. These combinations show strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that MASA, using only unlabeled static images, achieves better performance than state-of-the-art methods trained with fully annotated in-domain video sequences in zero-shot association. The MASA pipeline includes a MASA adapter that transforms features from frozen detection or segmentation backbones to learn generalizable instance appearance representations. It also constructs a unified model to jointly detect/segment and track anything. MASA outperforms existing methods on various benchmarks, including TAO MOT, Open-vocabulary MOT, BDD100K MOTS, and UVO. It achieves strong zero-shot association performance by learning discriminative instance embeddings from unlabeled images. The MASA adapter enables existing models to track any detected objects, and the method demonstrates strong generalization across diverse domains.