[slides and audio] Learning Multi-domain Convolutional Neural Networks for Visual Tracking

This paper proposes a novel visual tracking algorithm based on a multi-domain Convolutional Neural Network (CNN), referred to as MDNet. The algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. The network consists of shared layers and multiple domain-specific branches, where each branch is responsible for binary classification to identify the target in its respective domain. The network is trained iteratively with respect to each domain to obtain generic target representations in the shared layers. When tracking a target in a new sequence, a new network is constructed by combining the shared layers of the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating candidate windows randomly sampled around the previous target state. The proposed algorithm demonstrates outstanding performance compared to state-of-the-art methods in existing tracking benchmarks. The algorithm is designed to learn domain-independent information from domain-specific information, enabling it to capture shared representations effectively. The network architecture is smaller than typical CNNs used for image classification, making it more suitable for visual tracking. The algorithm also incorporates an effective online tracking framework based on the representations learned by MDNet. When a test sequence is given, existing binary classification branches are removed and a new single branch is constructed to compute target scores in the test sequence. The new classification layer and fully connected layers within the shared layers are fine-tuned online during tracking to adapt to the new domain. The online update is conducted to model long-term and short-term appearance variations of a target for robustness and adaptiveness, respectively. An effective and efficient hard negative mining technique is incorporated in the learning procedure. The algorithm consists of multi-domain representation learning and online visual tracking. The main contributions of the work include proposing a multi-domain learning framework based on CNNs that separates domain-independent information from domain-specific information to capture shared representations effectively. The framework is successfully applied to visual tracking, where the CNN pretrained by multi-domain learning is updated online in the context of a new sequence to learn domain-specific information adaptively. Extensive experiments demonstrate the outstanding performance of the tracking algorithm compared to state-of-the-art techniques in two public benchmarks: Object Tracking Benchmark (OTB) and VOT2014.This paper proposes a novel visual tracking algorithm based on a multi-domain Convolutional Neural Network (CNN), referred to as MDNet. The algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. The network consists of shared layers and multiple domain-specific branches, where each branch is responsible for binary classification to identify the target in its respective domain. The network is trained iteratively with respect to each domain to obtain generic target representations in the shared layers. When tracking a target in a new sequence, a new network is constructed by combining the shared layers of the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating candidate windows randomly sampled around the previous target state. The proposed algorithm demonstrates outstanding performance compared to state-of-the-art methods in existing tracking benchmarks. The algorithm is designed to learn domain-independent information from domain-specific information, enabling it to capture shared representations effectively. The network architecture is smaller than typical CNNs used for image classification, making it more suitable for visual tracking. The algorithm also incorporates an effective online tracking framework based on the representations learned by MDNet. When a test sequence is given, existing binary classification branches are removed and a new single branch is constructed to compute target scores in the test sequence. The new classification layer and fully connected layers within the shared layers are fine-tuned online during tracking to adapt to the new domain. The online update is conducted to model long-term and short-term appearance variations of a target for robustness and adaptiveness, respectively. An effective and efficient hard negative mining technique is incorporated in the learning procedure. The algorithm consists of multi-domain representation learning and online visual tracking. The main contributions of the work include proposing a multi-domain learning framework based on CNNs that separates domain-independent information from domain-specific information to capture shared representations effectively. The framework is successfully applied to visual tracking, where the CNN pretrained by multi-domain learning is updated online in the context of a new sequence to learn domain-specific information adaptively. Extensive experiments demonstrate the outstanding performance of the tracking algorithm compared to state-of-the-art techniques in two public benchmarks: Object Tracking Benchmark (OTB) and VOT2014.

Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

6 Jan 2016 | Hyeonseob Nam, Bohyung Han