17 Jun 2024 | Josh Gardner, Juan C. Perdomo, Ludwig Schmidt
This paper introduces TABULA-8B, a language model designed for tabular data prediction, addressing the gap in transfer learning for tabular data. TABULA-8B is fine-tuned on a large, high-quality dataset, T4, which is derived from the TabLib corpus and includes over 1.6 billion rows from 3.1 million unique tables. The model is evaluated on 329 datasets across five benchmarks, demonstrating zero-shot accuracy that is 17 percentage points higher than random guessing and outperforming state-of-the-art methods (XGBoost, TabPFN) by 5-15 percentage points in the few-shot setting. The paper also details the construction of T4, the filtering and quality control methods used, and the open-source release of the model, data, and code. Key contributions include the TABULA-8B model, the T4 dataset, and the row-causal tabular masking (RCTM) attention scheme, which enhances few-shot learning capabilities. The paper discusses limitations and suggests future research directions, including improvements in data filtering, model scaling, and addressing potential biases.This paper introduces TABULA-8B, a language model designed for tabular data prediction, addressing the gap in transfer learning for tabular data. TABULA-8B is fine-tuned on a large, high-quality dataset, T4, which is derived from the TabLib corpus and includes over 1.6 billion rows from 3.1 million unique tables. The model is evaluated on 329 datasets across five benchmarks, demonstrating zero-shot accuracy that is 17 percentage points higher than random guessing and outperforming state-of-the-art methods (XGBoost, TabPFN) by 5-15 percentage points in the few-shot setting. The paper also details the construction of T4, the filtering and quality control methods used, and the open-source release of the model, data, and code. Key contributions include the TABULA-8B model, the T4 dataset, and the row-causal tabular masking (RCTM) attention scheme, which enhances few-shot learning capabilities. The paper discusses limitations and suggests future research directions, including improvements in data filtering, model scaling, and addressing potential biases.