TextBoxes: A Fast Text Detector with a Single Deep Neural Network

TextBoxes: A Fast Text Detector with a Single Deep Neural Network

2017 | Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, Wenyu Liu
TextBoxes is an end-to-end trainable fast scene text detector that achieves high accuracy and efficiency in a single network forward pass, with no post-processing except for standard non-maximum suppression. It outperforms competing methods in text localization accuracy and is significantly faster, taking only 0.09 seconds per image in a fast implementation. When combined with a text recognizer, TextBoxes significantly outperforms state-of-the-art approaches on word spotting and end-to-end text recognition tasks. TextBoxes is based on a fully convolutional network and directly outputs the coordinates of word bounding boxes at multiple network layers by jointly predicting text presence and coordinate offsets to default boxes. The final outputs are the aggregation of all boxes, followed by a standard non-maximum suppression process. To handle the large variation in aspect ratios of words, TextBoxes uses several novel, inception-style output layers that utilize both irregular convolutional kernels and default boxes. The detector delivers both high accuracy and high efficiency with a single forward pass on single-scale inputs, and even higher accuracy with multiple passes on multi-scale inputs. TextBoxes is inspired by SSD, a recent development in object detection. SSD aims to detect general objects in images but fails on words with extreme aspect ratios. TextBoxes addresses this by introducing text-box layers that significantly improve performance. TextBoxes is combined with a text recognizer called CRNN for word spotting and end-to-end recognition. CRNN directly outputs character sequences given input images and is also end-to-end trainable. The confidence scores of CRNN are used to regularize the detection outputs of TextBoxes, further boosting the accuracy of word spotting. TextBoxes is tested on ICDAR 2011 and ICDAR 2013 datasets, achieving high performance in text localization. It outperforms competing methods in terms of f-measure, with TextBoxes ranking first in testing speed. When combined with a recognition model, TextBoxes achieves state-of-the-art performance on end-to-end recognition benchmarks. TextBoxes is also efficient, with a fast implementation taking only 0.09 seconds per image. TextBoxes performs well in most situations but still fails to handle some difficult cases, such as overexposure and large character spacing. The paper concludes that TextBoxes is an end-to-end fully convolutional network for text detection that is highly stable and efficient, with comprehensive evaluations and comparisons on benchmark datasets validating its advantages in text detection, word spotting, and end-to-end recognition tasks. Future work includes extending TextBoxes for multi-oriented texts and combining detection and recognition networks into a unified framework.TextBoxes is an end-to-end trainable fast scene text detector that achieves high accuracy and efficiency in a single network forward pass, with no post-processing except for standard non-maximum suppression. It outperforms competing methods in text localization accuracy and is significantly faster, taking only 0.09 seconds per image in a fast implementation. When combined with a text recognizer, TextBoxes significantly outperforms state-of-the-art approaches on word spotting and end-to-end text recognition tasks. TextBoxes is based on a fully convolutional network and directly outputs the coordinates of word bounding boxes at multiple network layers by jointly predicting text presence and coordinate offsets to default boxes. The final outputs are the aggregation of all boxes, followed by a standard non-maximum suppression process. To handle the large variation in aspect ratios of words, TextBoxes uses several novel, inception-style output layers that utilize both irregular convolutional kernels and default boxes. The detector delivers both high accuracy and high efficiency with a single forward pass on single-scale inputs, and even higher accuracy with multiple passes on multi-scale inputs. TextBoxes is inspired by SSD, a recent development in object detection. SSD aims to detect general objects in images but fails on words with extreme aspect ratios. TextBoxes addresses this by introducing text-box layers that significantly improve performance. TextBoxes is combined with a text recognizer called CRNN for word spotting and end-to-end recognition. CRNN directly outputs character sequences given input images and is also end-to-end trainable. The confidence scores of CRNN are used to regularize the detection outputs of TextBoxes, further boosting the accuracy of word spotting. TextBoxes is tested on ICDAR 2011 and ICDAR 2013 datasets, achieving high performance in text localization. It outperforms competing methods in terms of f-measure, with TextBoxes ranking first in testing speed. When combined with a recognition model, TextBoxes achieves state-of-the-art performance on end-to-end recognition benchmarks. TextBoxes is also efficient, with a fast implementation taking only 0.09 seconds per image. TextBoxes performs well in most situations but still fails to handle some difficult cases, such as overexposure and large character spacing. The paper concludes that TextBoxes is an end-to-end fully convolutional network for text detection that is highly stable and efficient, with comprehensive evaluations and comparisons on benchmark datasets validating its advantages in text detection, word spotting, and end-to-end recognition tasks. Future work includes extending TextBoxes for multi-oriented texts and combining detection and recognition networks into a unified framework.
Reach us at info@futurestudyspace.com
Understanding TextBoxes%3A A Fast Text Detector with a Single Deep Neural Network