26 Nov 2016 | Sebastien C. Wong, Adam Gatt, Victor Stamatescu and Mark D. McDonnell
This paper investigates the benefits of data augmentation in improving the performance of machine learning classifiers, specifically focusing on data warping and synthetic over-sampling. The authors experimentally evaluate these techniques using a convolutional backpropagation-trained neural network, a convolutional support vector machine (CSVM), and a convolutional extreme learning machine (CELM) classifier on the MNIST handwritten digit dataset. They find that while both methods can enhance performance, data warping in the data-space is more effective, especially when plausible transforms are known. Synthetic over-sampling in the feature-space, such as SMOTE, can also provide some benefits but is less effective and may even increase overfitting with DBSMOTE. The study concludes that the improvement in error rates from data augmentation is bounded by the equivalent amount of real data, and that more real data is generally better than more synthetic data.This paper investigates the benefits of data augmentation in improving the performance of machine learning classifiers, specifically focusing on data warping and synthetic over-sampling. The authors experimentally evaluate these techniques using a convolutional backpropagation-trained neural network, a convolutional support vector machine (CSVM), and a convolutional extreme learning machine (CELM) classifier on the MNIST handwritten digit dataset. They find that while both methods can enhance performance, data warping in the data-space is more effective, especially when plausible transforms are known. Synthetic over-sampling in the feature-space, such as SMOTE, can also provide some benefits but is less effective and may even increase overfitting with DBSMOTE. The study concludes that the improvement in error rates from data augmentation is bounded by the equivalent amount of real data, and that more real data is generally better than more synthetic data.