Synthetic Data for Text Localisation in Natural Images

Synthetic Data for Text Localisation in Natural Images

22 Apr 2016 | Ankush Gupta, Andrea Vedaldi, Andrew Zisserman
This paper introduces a new method for text detection in natural images. The method consists of two main contributions: (1) a fast and scalable engine to generate synthetic images of text in clutter, which overlays synthetic text onto existing background images in a natural way, accounting for local 3D scene geometry. (2) The use of these synthetic images to train a Fully-Convolutional Regression Network (FCRN), which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. The FCRN outperforms current methods for text detection in natural images, achieving an F-measure of 84.2% on the standard ICDAR 2013 benchmark and processing 15 images per second on a GPU. The paper discusses the relation of FCRN to the recently-introduced YOLO detector and other end-to-end object detection systems based on deep learning. The synthetic images are generated using a new dataset of synthetic text in the wild, which is suitable for training high-performance scene text detectors. The dataset, called SynthText in the Wild, contains 800,000 scene-text images with multiple instances of words rendered in different styles. The second contribution is a text detection deep architecture which is both accurate and efficient. This architecture, called FCRN, is a fully-convolutional regression network that performs prediction densely, at every image location. It predicts the parameters of a bounding box enclosing the word centered at that location, similar to the You Look Only Once (YOLO) technique but with convolutional regressors that significantly improve performance. The new data and detector achieve state-of-the-art text detection performance on standard benchmark datasets while being an order of magnitude faster than traditional text detectors at test time. The paper also demonstrates the importance of verisimilitude in the dataset by showing that if the detector is trained on images with words inserted synthetically that do not take account of the scene layout, the detection performance is substantially inferior. Finally, due to the more accurate detection step, end-to-end word recognition is also improved once the new detector is swapped in for existing ones in state-of-the-art pipelines.This paper introduces a new method for text detection in natural images. The method consists of two main contributions: (1) a fast and scalable engine to generate synthetic images of text in clutter, which overlays synthetic text onto existing background images in a natural way, accounting for local 3D scene geometry. (2) The use of these synthetic images to train a Fully-Convolutional Regression Network (FCRN), which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. The FCRN outperforms current methods for text detection in natural images, achieving an F-measure of 84.2% on the standard ICDAR 2013 benchmark and processing 15 images per second on a GPU. The paper discusses the relation of FCRN to the recently-introduced YOLO detector and other end-to-end object detection systems based on deep learning. The synthetic images are generated using a new dataset of synthetic text in the wild, which is suitable for training high-performance scene text detectors. The dataset, called SynthText in the Wild, contains 800,000 scene-text images with multiple instances of words rendered in different styles. The second contribution is a text detection deep architecture which is both accurate and efficient. This architecture, called FCRN, is a fully-convolutional regression network that performs prediction densely, at every image location. It predicts the parameters of a bounding box enclosing the word centered at that location, similar to the You Look Only Once (YOLO) technique but with convolutional regressors that significantly improve performance. The new data and detector achieve state-of-the-art text detection performance on standard benchmark datasets while being an order of magnitude faster than traditional text detectors at test time. The paper also demonstrates the importance of verisimilitude in the dataset by showing that if the detector is trained on images with words inserted synthetically that do not take account of the scene layout, the detection performance is substantially inferior. Finally, due to the more accurate detection step, end-to-end word recognition is also improved once the new detector is swapped in for existing ones in state-of-the-art pipelines.
Reach us at info@study.space
Understanding Synthetic Data for Text Localisation in Natural Images