2016 | Zhi Tian, Weilin Huang*, Tong He, Pan He, and Yu Qiao
This paper proposes a novel Connectionist Text Proposal Network (CTPN) for accurate text line localization in natural images. The CTPN directly detects text lines in convolutional feature maps by generating a sequence of fine-scale text proposals. It introduces a vertical anchor mechanism that jointly predicts the location and text/non-text score of each proposal, significantly improving localization accuracy. The sequential proposals are connected via a recurrent neural network (RNN), seamlessly integrated into the convolutional network, enabling end-to-end training. This allows the CTPN to explore rich contextual information, making it effective for detecting ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, surpassing previous methods requiring multiple post-filtering steps. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, respectively, outperforming recent results. The CTPN is computationally efficient, with a running time of 0.14 seconds per image using the VGG16 model. The CTPN is effective in detecting text in various scales and languages, with a side-refinement mechanism improving localization accuracy. The model outputs three predictions: text/non-text scores, vertical coordinates, and side-refinement offsets. The CTPN is trained end-to-end using multi-task learning, with three loss functions for classification, regression, and side-refinement. The model achieves state-of-the-art performance on multiple benchmarks, with significant improvements in recall and F-measure. The CTPN is efficient and accurate, with a running time of 0.14 seconds per image.This paper proposes a novel Connectionist Text Proposal Network (CTPN) for accurate text line localization in natural images. The CTPN directly detects text lines in convolutional feature maps by generating a sequence of fine-scale text proposals. It introduces a vertical anchor mechanism that jointly predicts the location and text/non-text score of each proposal, significantly improving localization accuracy. The sequential proposals are connected via a recurrent neural network (RNN), seamlessly integrated into the convolutional network, enabling end-to-end training. This allows the CTPN to explore rich contextual information, making it effective for detecting ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, surpassing previous methods requiring multiple post-filtering steps. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, respectively, outperforming recent results. The CTPN is computationally efficient, with a running time of 0.14 seconds per image using the VGG16 model. The CTPN is effective in detecting text in various scales and languages, with a side-refinement mechanism improving localization accuracy. The model outputs three predictions: text/non-text scores, vertical coordinates, and side-refinement offsets. The CTPN is trained end-to-end using multi-task learning, with three loss functions for classification, regression, and side-refinement. The model achieves state-of-the-art performance on multiple benchmarks, with significant improvements in recall and F-measure. The CTPN is efficient and accurate, with a running time of 0.14 seconds per image.