8 Mar 2015 | Jonathan Long*, Evan Shelhamer*, Trevor Darrell
This paper presents a fully convolutional network (FCN) for semantic segmentation, which achieves state-of-the-art results on several benchmark datasets. The key idea is to build a fully convolutional network that can take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. The authors adapt existing classification networks (AlexNet, VGG net, GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. They then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.
The FCN is trained end-to-end, pixel-to-pixel, and achieves a 20% relative improvement in mean IU on the PASCAL VOC 2012 dataset. It also performs well on NYUDv2 and SIFT Flow datasets, with inference taking less than one fifth of a second for a typical image. The authors introduce a novel "skip" architecture that combines deep, coarse, semantic information with shallow, fine, appearance information to refine predictions. They also discuss various techniques for dense prediction, including upsampling, patchwise training, and loss sampling.
The paper also compares their FCN with previous state-of-the-art methods, such as SDS and R-CNN, and shows that their approach achieves better performance on multiple metrics. The authors conclude that fully convolutional networks are a rich class of models that can be used for semantic segmentation, and that extending classification networks to segmentation and improving the architecture with multi-resolution layer combinations significantly improves the state-of-the-art while simplifying and speeding up learning and inference.This paper presents a fully convolutional network (FCN) for semantic segmentation, which achieves state-of-the-art results on several benchmark datasets. The key idea is to build a fully convolutional network that can take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. The authors adapt existing classification networks (AlexNet, VGG net, GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. They then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations.
The FCN is trained end-to-end, pixel-to-pixel, and achieves a 20% relative improvement in mean IU on the PASCAL VOC 2012 dataset. It also performs well on NYUDv2 and SIFT Flow datasets, with inference taking less than one fifth of a second for a typical image. The authors introduce a novel "skip" architecture that combines deep, coarse, semantic information with shallow, fine, appearance information to refine predictions. They also discuss various techniques for dense prediction, including upsampling, patchwise training, and loss sampling.
The paper also compares their FCN with previous state-of-the-art methods, such as SDS and R-CNN, and shows that their approach achieves better performance on multiple metrics. The authors conclude that fully convolutional networks are a rich class of models that can be used for semantic segmentation, and that extending classification networks to segmentation and improving the architecture with multi-resolution layer combinations significantly improves the state-of-the-art while simplifying and speeding up learning and inference.