22 Apr 2016 | Ankush Gupta Andrea Vedaldi Andrew Zisserman
This paper introduces a novel method for text detection in natural images, comprising two main contributions. First, it presents a fast and scalable engine to generate synthetic images of text in cluttered scenes, blending synthetic text naturally with existing background images while accounting for local 3D scene geometry. Second, it employs these synthetic images to train a Fully-Convolutional Regression Network (FCRN), which efficiently performs text detection and bounding-box regression at various scales and locations in an image. The FCRN is compared to the YOLO detector and other end-to-end deep learning-based object detection systems. The resulting detection network significantly outperforms current methods, achieving an F-measure of 84.2% on the ICdar 2013 benchmark and processing 15 images per second on a GPU. The paper also discusses the importance of verisimilitude in the synthetic dataset, demonstrating that performance degrades when synthetic text is not aligned with the scene layout. Additionally, the improved detection step enhances end-to-end word recognition in existing pipelines.This paper introduces a novel method for text detection in natural images, comprising two main contributions. First, it presents a fast and scalable engine to generate synthetic images of text in cluttered scenes, blending synthetic text naturally with existing background images while accounting for local 3D scene geometry. Second, it employs these synthetic images to train a Fully-Convolutional Regression Network (FCRN), which efficiently performs text detection and bounding-box regression at various scales and locations in an image. The FCRN is compared to the YOLO detector and other end-to-end deep learning-based object detection systems. The resulting detection network significantly outperforms current methods, achieving an F-measure of 84.2% on the ICdar 2013 benchmark and processing 15 images per second on a GPU. The paper also discusses the importance of verisimilitude in the synthetic dataset, demonstrating that performance degrades when synthetic text is not aligned with the scene layout. Additionally, the improved detection step enhances end-to-end word recognition in existing pipelines.