Understanding Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

This paper presents a framework for natural scene text recognition that does not require human-labeled data and performs word recognition holistically, moving away from character-based systems. The framework uses deep neural networks trained on synthetic data generated by a text generation engine, which produces highly realistic data sufficient to replace real data, providing an unlimited amount of training data. Three models are considered: dictionary encoding, character sequence encoding, and bag-of-N-grams encoding. These models significantly improve state-of-the-art performance on standard datasets, especially in language-based and unconstrained text recognition tasks, using fast and simple machinery with no data acquisition costs. The framework is based on synthetic data generation, which creates realistic scene text images by rendering text with various fonts, applying distortions, and blending with natural images. The synthetic data is used to train deep learning models for word recognition, allowing for the creation of large vocabularies and other languages without human labeling. The models are evaluated on standard datasets such as ICDAR 2003, Street View Text, and IIIT5k, showing significant improvements in accuracy. The paper describes three models for scene text recognition: a dictionary-based model that classifies words from a large dictionary, a character sequence model that predicts characters in a word, and a bag-of-N-grams model that encodes words as unordered sets of N-grams. These models are trained on synthetic data and show superior performance on standard benchmarks. The results demonstrate that the synthetic data approach is effective for scene text recognition, with the dictionary-based model achieving high accuracy on large vocabularies. The paper also discusses the advantages of using synthetic data, including the ability to generate large datasets and the potential for scalable word recognition. The framework is shown to be effective for real-world applications, with the largest model, DICT+2-90k, achieving high accuracy and processing speed on a single GPU. The paper concludes that the proposed framework sets a new benchmark for scene text recognition.This paper presents a framework for natural scene text recognition that does not require human-labeled data and performs word recognition holistically, moving away from character-based systems. The framework uses deep neural networks trained on synthetic data generated by a text generation engine, which produces highly realistic data sufficient to replace real data, providing an unlimited amount of training data. Three models are considered: dictionary encoding, character sequence encoding, and bag-of-N-grams encoding. These models significantly improve state-of-the-art performance on standard datasets, especially in language-based and unconstrained text recognition tasks, using fast and simple machinery with no data acquisition costs. The framework is based on synthetic data generation, which creates realistic scene text images by rendering text with various fonts, applying distortions, and blending with natural images. The synthetic data is used to train deep learning models for word recognition, allowing for the creation of large vocabularies and other languages without human labeling. The models are evaluated on standard datasets such as ICDAR 2003, Street View Text, and IIIT5k, showing significant improvements in accuracy. The paper describes three models for scene text recognition: a dictionary-based model that classifies words from a large dictionary, a character sequence model that predicts characters in a word, and a bag-of-N-grams model that encodes words as unordered sets of N-grams. These models are trained on synthetic data and show superior performance on standard benchmarks. The results demonstrate that the synthetic data approach is effective for scene text recognition, with the dictionary-based model achieving high accuracy on large vocabularies. The paper also discusses the advantages of using synthetic data, including the ability to generate large datasets and the potential for scalable word recognition. The framework is shown to be effective for real-world applications, with the largest model, DICT+2-90k, achieving high accuracy and processing speed on a single GPU. The paper concludes that the proposed framework sets a new benchmark for scene text recognition.

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

9 Dec 2014 | Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman