TabPFGen – Tabular Data Generation with TabPFN

TabPFGen – Tabular Data Generation with TabPFN

7 Jun 2024 | Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, Anthony Caterini
TabPFGen is a novel energy-based generative model that leverages the pre-trained TabPFN, a transformer-based model originally designed for in-context discriminative tabular tasks. By converting TabPFN into an energy-based model, TabPFGen inherits its in-context learning capability without requiring additional training or hyperparameter tuning. This approach allows for efficient generation of synthetic tabular data, enabling tasks such as data augmentation, class balancing, and imputation. The model uses the stochastic gradient Langevin dynamics (SGLD) algorithm for sample generation, and it demonstrates strong performance on standard generative modelling tasks. Experiments on 18 datasets from OpenML-CC18 show that TabPFGen significantly improves downstream model performance, outperforming competitive baselines in data augmentation, class balancing, and imputation. TabPFGen also generates samples that closely align with the training data distribution, showcasing its potential for practical tabular data generation. The method is efficient and effective, with the potential for further improvements as transformer architectures advance. However, current limitations such as input size constraints and focus on numerical datasets restrict its applicability to large-scale datasets.TabPFGen is a novel energy-based generative model that leverages the pre-trained TabPFN, a transformer-based model originally designed for in-context discriminative tabular tasks. By converting TabPFN into an energy-based model, TabPFGen inherits its in-context learning capability without requiring additional training or hyperparameter tuning. This approach allows for efficient generation of synthetic tabular data, enabling tasks such as data augmentation, class balancing, and imputation. The model uses the stochastic gradient Langevin dynamics (SGLD) algorithm for sample generation, and it demonstrates strong performance on standard generative modelling tasks. Experiments on 18 datasets from OpenML-CC18 show that TabPFGen significantly improves downstream model performance, outperforming competitive baselines in data augmentation, class balancing, and imputation. TabPFGen also generates samples that closely align with the training data distribution, showcasing its potential for practical tabular data generation. The method is efficient and effective, with the potential for further improvements as transformer architectures advance. However, current limitations such as input size constraints and focus on numerical datasets restrict its applicability to large-scale datasets.
Reach us at info@study.space