2024 | Myung Jun Kim, Léo Grinsztajn, Gaël Varoquaux
CARTE is a neural architecture designed for tabular learning that eliminates the need for schema or string matching. It uses a graph representation of tabular data, string embeddings for column names and entries, and a graph-attentional network to contextualize entries. CARTE is pretrained on a large knowledge base, enabling it to learn across tables with unmatched columns and improve performance on downstream tasks. It outperforms tree-based models and other baselines, including those with feature engineering, and enables joint learning across tables. CARTE's graph representation allows it to handle discrete entries and open vocabularies, making it effective for tabular data. The model is pretrained on YAGO3, a large knowledge base, and uses self-supervised learning with contrastive loss. CARTE is fine-tuned for downstream tasks, including single-table learning and transfer learning between tables. It is robust to missing values and can handle both numerical and string entries. CARTE's performance is validated across multiple datasets, showing consistent improvements over baselines. The model's architecture enables pretraining on large-scale data and fine-tuning for various tasks, opening the door to large pretrained models for tabular data. The study highlights the importance of strings in tabular data and the potential of CARTE for tabular foundation models. The paper concludes that CARTE offers significant benefits for tabular learning, enabling efficient and effective learning across diverse datasets.CARTE is a neural architecture designed for tabular learning that eliminates the need for schema or string matching. It uses a graph representation of tabular data, string embeddings for column names and entries, and a graph-attentional network to contextualize entries. CARTE is pretrained on a large knowledge base, enabling it to learn across tables with unmatched columns and improve performance on downstream tasks. It outperforms tree-based models and other baselines, including those with feature engineering, and enables joint learning across tables. CARTE's graph representation allows it to handle discrete entries and open vocabularies, making it effective for tabular data. The model is pretrained on YAGO3, a large knowledge base, and uses self-supervised learning with contrastive loss. CARTE is fine-tuned for downstream tasks, including single-table learning and transfer learning between tables. It is robust to missing values and can handle both numerical and string entries. CARTE's performance is validated across multiple datasets, showing consistent improvements over baselines. The model's architecture enables pretraining on large-scale data and fine-tuning for various tasks, opening the door to large pretrained models for tabular data. The study highlights the importance of strings in tabular data and the potential of CARTE for tabular foundation models. The paper concludes that CARTE offers significant benefits for tabular learning, enabling efficient and effective learning across diverse datasets.