Fast Online Object Tracking and Segmentation: A Unifying Approach

Fast Online Object Tracking and Segmentation: A Unifying Approach

5 May 2019 | Qiang Wang*, Li Zhang*, Luca Bertinetto*, Weiming Hu, Philip H.S. Torr
This paper introduces SiamMask, a simple approach that enables fully-convolutional Siamese trackers to produce class-agnostic binary segmentation masks of the target object. The method combines visual object tracking and semi-supervised video object segmentation into a single framework, achieving real-time performance with high accuracy. SiamMask improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state of the art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semi-supervised video object segmentation task on DAVIS-2016 and DAVIS-2017. The project website is http://www.robots.ox.ac.uk/~qwang/SiamMask. SiamMask is a multi-task learning approach that can be used to address both visual object tracking and semi-supervised video object segmentation. The method is motivated by the success of fast tracking approaches based on fully-convolutional Siamese networks trained offline on millions of pairs of video frames and by the very recent availability of YouTube-VOS, a large video dataset with pixel-wise annotations. We aim at retaining the offline trainability and online speed of these methods while at the same time significantly refining their representation of the target object, which is limited to a simple axis-aligned bounding box. To achieve this goal, we simultaneously train a Siamese network on three tasks, each corresponding to a different strategy to establish correspondances between the target object and candidate regions in the new frames. As in the fully-convolutional approach of Bertinetto et al., one task is to learn a measure of similarity between the target object and multiple candidates in a sliding window fashion. The output is a dense response map which only indicates the location of the object, without providing any information about its spatial extent. To refine this information, we simultaneously learn two further tasks: bounding box regression using a Region Proposal Network and class-agnostic binary segmentation. Notably, binary labels are only required during offline training to compute the segmentation loss and not online during segmentation/tracking. In our proposed architecture, each task is represented by a different branch departing from a shared CNN and contributes towards a final loss, which sums the three outputs together. Once trained, SiamMask solely relies on a single bounding box initialisation, operates online without updates and produces object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity and fast speed, SiamMask establishes a new state-of-the-art on VOT-2018 for the problem of real-time object tracking.This paper introduces SiamMask, a simple approach that enables fully-convolutional Siamese trackers to produce class-agnostic binary segmentation masks of the target object. The method combines visual object tracking and semi-supervised video object segmentation into a single framework, achieving real-time performance with high accuracy. SiamMask improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state of the art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semi-supervised video object segmentation task on DAVIS-2016 and DAVIS-2017. The project website is http://www.robots.ox.ac.uk/~qwang/SiamMask. SiamMask is a multi-task learning approach that can be used to address both visual object tracking and semi-supervised video object segmentation. The method is motivated by the success of fast tracking approaches based on fully-convolutional Siamese networks trained offline on millions of pairs of video frames and by the very recent availability of YouTube-VOS, a large video dataset with pixel-wise annotations. We aim at retaining the offline trainability and online speed of these methods while at the same time significantly refining their representation of the target object, which is limited to a simple axis-aligned bounding box. To achieve this goal, we simultaneously train a Siamese network on three tasks, each corresponding to a different strategy to establish correspondances between the target object and candidate regions in the new frames. As in the fully-convolutional approach of Bertinetto et al., one task is to learn a measure of similarity between the target object and multiple candidates in a sliding window fashion. The output is a dense response map which only indicates the location of the object, without providing any information about its spatial extent. To refine this information, we simultaneously learn two further tasks: bounding box regression using a Region Proposal Network and class-agnostic binary segmentation. Notably, binary labels are only required during offline training to compute the segmentation loss and not online during segmentation/tracking. In our proposed architecture, each task is represented by a different branch departing from a shared CNN and contributes towards a final loss, which sums the three outputs together. Once trained, SiamMask solely relies on a single bounding box initialisation, operates online without updates and produces object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity and fast speed, SiamMask establishes a new state-of-the-art on VOT-2018 for the problem of real-time object tracking.
Reach us at info@study.space
Understanding Fast_Online_Object_Tracking_and_Segmentation%3A_A_Unifying_Approach