26 Nov 2016 | Sebastien C. Wong, Adam Gatt, Victor Stamatescu and Mark D. McDonnell
This paper investigates the benefits of data augmentation for improving the performance of machine learning classifiers, specifically for convolutional neural networks (CNNs), convolutional support vector machines (CSVMs), and convolutional extreme learning machines (CELMs). The study uses the MNIST handwritten digit dataset to evaluate two approaches: data warping (data-space augmentation) and synthetic over-sampling (feature-space augmentation). The results show that data-space augmentation, particularly using elastic deformations, provides greater benefits in improving performance and reducing overfitting compared to feature-space augmentation methods like SMOTE and DBSMOTE.
Data warping involves applying transformations to images to generate new training samples, preserving label information. Elastic deformations, which involve creating a normalized random displacement field, were found to be effective in generating plausible transformations. However, large deformations can lead to characters that are difficult for humans to recognize, indicating a loss of label integrity. The study found that using α = 1.2 pixels with σ = 20 provided a good balance between performance and label preservation.
In contrast, synthetic over-sampling methods like SMOTE and DBSMOTE generate new samples in feature-space. While SMOTE showed some improvement, DBSMOTE was found to increase overfitting by creating samples close to existing cluster centers. The study also found that more real data generally leads to better performance than synthetic data, with synthetic data augmentation being bounded by the equivalent amount of real data.
The experiments showed that data-space augmentation using elastic deformations was most effective for all classifiers, particularly for the CNN. For the CSVM, data-space augmentation was more beneficial than feature-space methods. For the CELM, data-space augmentation provided some improvement, but synthetic data was not always better. Overall, the study concludes that data-space augmentation is preferable when label-preserving transforms are known, while feature-space methods like SMOTE can be used when such transforms are not available. However, DBSMOTE should be avoided as it can increase overfitting.This paper investigates the benefits of data augmentation for improving the performance of machine learning classifiers, specifically for convolutional neural networks (CNNs), convolutional support vector machines (CSVMs), and convolutional extreme learning machines (CELMs). The study uses the MNIST handwritten digit dataset to evaluate two approaches: data warping (data-space augmentation) and synthetic over-sampling (feature-space augmentation). The results show that data-space augmentation, particularly using elastic deformations, provides greater benefits in improving performance and reducing overfitting compared to feature-space augmentation methods like SMOTE and DBSMOTE.
Data warping involves applying transformations to images to generate new training samples, preserving label information. Elastic deformations, which involve creating a normalized random displacement field, were found to be effective in generating plausible transformations. However, large deformations can lead to characters that are difficult for humans to recognize, indicating a loss of label integrity. The study found that using α = 1.2 pixels with σ = 20 provided a good balance between performance and label preservation.
In contrast, synthetic over-sampling methods like SMOTE and DBSMOTE generate new samples in feature-space. While SMOTE showed some improvement, DBSMOTE was found to increase overfitting by creating samples close to existing cluster centers. The study also found that more real data generally leads to better performance than synthetic data, with synthetic data augmentation being bounded by the equivalent amount of real data.
The experiments showed that data-space augmentation using elastic deformations was most effective for all classifiers, particularly for the CNN. For the CSVM, data-space augmentation was more beneficial than feature-space methods. For the CELM, data-space augmentation provided some improvement, but synthetic data was not always better. Overall, the study concludes that data-space augmentation is preferable when label-preserving transforms are known, while feature-space methods like SMOTE can be used when such transforms are not available. However, DBSMOTE should be avoided as it can increase overfitting.