(2024) 11:38 | Robert K. L. Kennedy, Flavio Villanustre, Taghi M. Khoshgoftaar, Zahra Salekshahrezaee
The paper addresses the challenge of acquiring labeled datasets for machine learning, particularly in the context of highly imbalanced credit card fraud detection data. The authors propose a novel methodology that uses an autoencoder to synthesize class labels for unlabeled data, aiming to minimize expert intervention. The autoencoder learns from dataset features to produce an error metric, which is then used to create new binary class labels. The method aims to automatically generate high-quality labels with minimal expert input, which are subsequently used to train supervised classifiers for fraud detection. Empirical results show that the synthesized labels significantly improve the performance of classifiers, as measured by the area under the precision-recall curve (AUPRC). The study also explores the effect of varying the number of positive-labeled instances on classifier performance, finding that AUPRC improves as more instances are labeled positively. The methodology effectively addresses the challenges of high class imbalance and unlabeled data, making it suitable for real-world applications in fraud detection.The paper addresses the challenge of acquiring labeled datasets for machine learning, particularly in the context of highly imbalanced credit card fraud detection data. The authors propose a novel methodology that uses an autoencoder to synthesize class labels for unlabeled data, aiming to minimize expert intervention. The autoencoder learns from dataset features to produce an error metric, which is then used to create new binary class labels. The method aims to automatically generate high-quality labels with minimal expert input, which are subsequently used to train supervised classifiers for fraud detection. Empirical results show that the synthesized labels significantly improve the performance of classifiers, as measured by the area under the precision-recall curve (AUPRC). The study also explores the effect of varying the number of positive-labeled instances on classifier performance, finding that AUPRC improves as more instances are labeled positively. The methodology effectively addresses the challenges of high class imbalance and unlabeled data, making it suitable for real-world applications in fraud detection.