Large Scale Transfer Learning for Tabular Data via Language Modeling

Large Scale Transfer Learning for Tabular Data via Language Modeling

17 Jun 2024 | Josh Gardner, Juan C. Perdomo, Ludwig Schmidt
This paper introduces TABULA-8B, a large language model for tabular prediction, and T4, a high-quality training dataset for tabular data. TABULA-8B is trained on T4, which contains 3.1M unique tables with over 1.6B rows. The model is fine-tuned for tabular prediction tasks, including classification and binned regression, using a novel packing and attention scheme. Evaluation across 329 datasets shows that TABULA-8B achieves zero-shot accuracy over 15 percentage points higher than random guessing, outperforming existing state-of-the-art tabular prediction models like XGBoost and TabPFN. In few-shot settings, TABULA-8B is 5-15 pp more accurate than these models. The paper also presents an ablation study on row-causal tabular masking (RCTM), which improves the model's ability to attend to samples within the same table, enhancing few-shot learning. Additionally, the study evaluates the model's robustness to various challenges, including the removal of informative column headers, feature dropout, and column order invariance. The results show that TABULA-8B maintains consistent performance across these scenarios. The paper also addresses the potential impact of data contamination in the training dataset and finds no significant effect on model performance. Finally, the paper discusses limitations of TABULA-8B, including its limited context window and computational costs, and suggests future research directions for improving and extending the model.This paper introduces TABULA-8B, a large language model for tabular prediction, and T4, a high-quality training dataset for tabular data. TABULA-8B is trained on T4, which contains 3.1M unique tables with over 1.6B rows. The model is fine-tuned for tabular prediction tasks, including classification and binned regression, using a novel packing and attention scheme. Evaluation across 329 datasets shows that TABULA-8B achieves zero-shot accuracy over 15 percentage points higher than random guessing, outperforming existing state-of-the-art tabular prediction models like XGBoost and TabPFN. In few-shot settings, TABULA-8B is 5-15 pp more accurate than these models. The paper also presents an ablation study on row-causal tabular masking (RCTM), which improves the model's ability to attend to samples within the same table, enhancing few-shot learning. Additionally, the study evaluates the model's robustness to various challenges, including the removal of informative column headers, feature dropout, and column order invariance. The results show that TABULA-8B maintains consistent performance across these scenarios. The paper also addresses the potential impact of data contamination in the training dataset and finds no significant effect on model performance. Finally, the paper discusses limitations of TABULA-8B, including its limited context window and computational costs, and suggests future research directions for improving and extending the model.
Reach us at info@study.space