17 Apr 2024 | George Retsinas, Giorgos Sfikas, Basilis Gatos, and Christophoros Nikou
This paper presents best practices for building effective handwritten text recognition (HTR) systems. The authors propose three key modifications to improve performance: 1) retain the aspect ratio of images during preprocessing by padding, 2) use column-wise max-pooling to reduce parameters and improve performance, and 3) add a CTC shortcut branch to assist training. These modifications are applied to a basic convolutional-recurrent (CNN+LSTM) architecture, achieving state-of-the-art results on the IAM and RIMES datasets. The proposed system uses a simple preprocessing pipeline, including image resizing, padding, and augmentation, and employs a CTC loss for sequence alignment and recognition. The CTC shortcut branch, consisting of a single 1D convolutional layer, is trained alongside the main network to improve training efficiency and performance. The system is evaluated on two widely used datasets, showing significant improvements in character and word error rates. The results demonstrate that the proposed best practices are effective and can be applied to a wide range of HTR systems. The paper also discusses related work, highlighting the importance of the proposed modifications in improving performance and generalization. The authors conclude that their approach provides a simple yet effective set of best practices for building HTR systems.This paper presents best practices for building effective handwritten text recognition (HTR) systems. The authors propose three key modifications to improve performance: 1) retain the aspect ratio of images during preprocessing by padding, 2) use column-wise max-pooling to reduce parameters and improve performance, and 3) add a CTC shortcut branch to assist training. These modifications are applied to a basic convolutional-recurrent (CNN+LSTM) architecture, achieving state-of-the-art results on the IAM and RIMES datasets. The proposed system uses a simple preprocessing pipeline, including image resizing, padding, and augmentation, and employs a CTC loss for sequence alignment and recognition. The CTC shortcut branch, consisting of a single 1D convolutional layer, is trained alongside the main network to improve training efficiency and performance. The system is evaluated on two widely used datasets, showing significant improvements in character and word error rates. The results demonstrate that the proposed best practices are effective and can be applied to a wide range of HTR systems. The paper also discusses related work, highlighting the importance of the proposed modifications in improving performance and generalization. The authors conclude that their approach provides a simple yet effective set of best practices for building HTR systems.