Modeling Tabular Data using Conditional GAN

Modeling Tabular Data using Conditional GAN

28 Oct 2019 | Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni
This paper introduces CTGAN, a conditional generative adversarial network (GAN) designed to generate synthetic tabular data. CTGAN addresses the challenges of modeling mixed data types, including both continuous and discrete columns with complex distributions, and imbalanced categorical columns. The paper evaluates CTGAN against Bayesian network baselines and other deep learning methods on 7 simulated and 8 real datasets. CTGAN outperforms Bayesian methods on most real datasets and performs better than other deep learning methods in terms of likelihood fitness and machine learning efficacy. CTGAN introduces mode-specific normalization to handle non-Gaussian and multimodal distributions, a conditional generator to address imbalanced discrete columns, and training-by-sampling to ensure even distribution of categories during training. The model uses fully-connected networks and recent techniques to train a high-quality model. CTGAN is compared with TVAE, a variational autoencoder adapted for tabular data generation. CTGAN achieves competitive performance across many datasets and outperforms TVAE on 3 datasets. The paper also presents a comprehensive benchmarking system for synthetic data generation algorithms, including 5 deep learning methods, 2 Bayesian network methods, 15 datasets, and 2 evaluation mechanisms. The system is open source and can be extended with other methods and additional datasets. The paper discusses the challenges of generating synthetic tabular data, including mixed data types, non-Gaussian distributions, multimodal distributions, and imbalanced categorical columns. These challenges make it difficult for GANs to model tabular data effectively. CTGAN addresses these challenges through its mode-specific normalization, conditional generator, and training-by-sampling techniques. The model is evaluated on simulated and real datasets, showing that it performs well in terms of likelihood fitness and machine learning efficacy. The paper concludes that CTGAN is a flexible and robust model for learning the distribution of columns with complex distributions.This paper introduces CTGAN, a conditional generative adversarial network (GAN) designed to generate synthetic tabular data. CTGAN addresses the challenges of modeling mixed data types, including both continuous and discrete columns with complex distributions, and imbalanced categorical columns. The paper evaluates CTGAN against Bayesian network baselines and other deep learning methods on 7 simulated and 8 real datasets. CTGAN outperforms Bayesian methods on most real datasets and performs better than other deep learning methods in terms of likelihood fitness and machine learning efficacy. CTGAN introduces mode-specific normalization to handle non-Gaussian and multimodal distributions, a conditional generator to address imbalanced discrete columns, and training-by-sampling to ensure even distribution of categories during training. The model uses fully-connected networks and recent techniques to train a high-quality model. CTGAN is compared with TVAE, a variational autoencoder adapted for tabular data generation. CTGAN achieves competitive performance across many datasets and outperforms TVAE on 3 datasets. The paper also presents a comprehensive benchmarking system for synthetic data generation algorithms, including 5 deep learning methods, 2 Bayesian network methods, 15 datasets, and 2 evaluation mechanisms. The system is open source and can be extended with other methods and additional datasets. The paper discusses the challenges of generating synthetic tabular data, including mixed data types, non-Gaussian distributions, multimodal distributions, and imbalanced categorical columns. These challenges make it difficult for GANs to model tabular data effectively. CTGAN addresses these challenges through its mode-specific normalization, conditional generator, and training-by-sampling techniques. The model is evaluated on simulated and real datasets, showing that it performs well in terms of likelihood fitness and machine learning efficacy. The paper concludes that CTGAN is a flexible and robust model for learning the distribution of columns with complex distributions.
Reach us at info@futurestudyspace.com