Understanding Synthesizing class labels for highly imbalanced credit card fraud detection data

This paper presents a novel methodology for synthesizing class labels in highly imbalanced credit card fraud detection data. The approach uses an autoencoder to learn from dataset features and generate an error metric for creating new binary class labels. The methodology aims to automatically produce new labels with minimal expert input, which are then used to train supervised classifiers for fraud detection. Empirical results show that the synthesized labels are of high quality and significantly improve classifier performance, particularly when using the area under the precision-recall curve (AUPRC). The methodology effectively addresses the challenges of high class imbalance by creating new and effective class labels. The approach is evaluated on the publicly available credit card fraud detection dataset, which consists of real-world transactions and is highly imbalanced. The methodology is compared with other approaches, including unsupervised anomaly detection methods, and shows superior performance in classification tasks. The results demonstrate that the synthesized labels improve the performance of supervised classifiers, especially when more instances are labeled as positive and belong to the minority class. The methodology is automated and does not require intensive human intervention, making it suitable for large and highly imbalanced datasets. The approach is evaluated using six supervised classifiers, including decision trees, random forests, extra trees, logistic regression, Naïve Bayes, and multilayer perceptrons. The results show that the synthesized labels improve the performance of these classifiers, particularly in terms of AUPRC. The methodology is effective in addressing the challenges of high class imbalance and provides a new approach for synthesizing class labels in imbalanced datasets.This paper presents a novel methodology for synthesizing class labels in highly imbalanced credit card fraud detection data. The approach uses an autoencoder to learn from dataset features and generate an error metric for creating new binary class labels. The methodology aims to automatically produce new labels with minimal expert input, which are then used to train supervised classifiers for fraud detection. Empirical results show that the synthesized labels are of high quality and significantly improve classifier performance, particularly when using the area under the precision-recall curve (AUPRC). The methodology effectively addresses the challenges of high class imbalance by creating new and effective class labels. The approach is evaluated on the publicly available credit card fraud detection dataset, which consists of real-world transactions and is highly imbalanced. The methodology is compared with other approaches, including unsupervised anomaly detection methods, and shows superior performance in classification tasks. The results demonstrate that the synthesized labels improve the performance of supervised classifiers, especially when more instances are labeled as positive and belong to the minority class. The methodology is automated and does not require intensive human intervention, making it suitable for large and highly imbalanced datasets. The approach is evaluated using six supervised classifiers, including decision trees, random forests, extra trees, logistic regression, Naïve Bayes, and multilayer perceptrons. The results show that the synthesized labels improve the performance of these classifiers, particularly in terms of AUPRC. The methodology is effective in addressing the challenges of high class imbalance and provides a new approach for synthesizing class labels in imbalanced datasets.

Synthesizing class labels for highly imbalanced credit card fraud detection data

2024 | Robert K. L. Kennedy¹, Flavio Villanustre², Taghi M. Khoshgoftaar¹ and Zahra Salekshahrezaee¹