07/2024 | Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, Christos Faloutsos
This survey provides a comprehensive review of recent advancements in applying large language models (LLMs) to tabular data for tasks such as prediction, data generation, and table understanding. It identifies key techniques, metrics, datasets, and methodologies used in this domain, while highlighting strengths, limitations, and gaps in existing literature. The paper also offers insights into future research directions and provides relevant code and datasets for further exploration.
Tabular data, characterized by heterogeneity, sparsity, and complex dependencies, poses unique challenges for modeling. Traditional methods like gradient-boosted decision trees (GBDT) remain state-of-the-art for classification tasks, while deep learning methods such as data transformation, differentiable trees, attention-based methods, and regularization techniques have shown promise in handling tabular data. However, LLMs have emerged as a powerful alternative, offering capabilities such as in-context learning, instruction following, and multi-step reasoning.
The paper discusses various applications of LLMs in tabular data modeling, including prediction, data synthesis, and table understanding. For prediction tasks, LLMs are used to generate text-based inputs, which are then fine-tuned for specific tasks. Data synthesis involves generating synthetic data for augmentation, imputation, and class rebalancing. Table understanding includes tasks like question answering, natural language inference, and Text2SQL, where LLMs are used to translate natural language questions into structured queries.
Key techniques for applying LLMs to tabular data include serialization, table manipulations, prompt engineering, and building end-to-end systems. Serialization involves converting tabular data into text formats, while table manipulations help in compressing and processing large tables. Prompt engineering involves designing effective prompts to guide LLMs in performing specific tasks, and end-to-end systems allow LLMs to interact with databases and execute commands.
The paper also discusses the challenges and opportunities of using LLMs for tabular data modeling, including the need for robustness to table manipulations, the importance of context in task performance, and the potential for LLMs to solve complex tasks beyond traditional NLP applications. The survey concludes with a discussion of future research directions, emphasizing the need for improved performance, better representations, and standardized benchmarks in tabular data modeling.This survey provides a comprehensive review of recent advancements in applying large language models (LLMs) to tabular data for tasks such as prediction, data generation, and table understanding. It identifies key techniques, metrics, datasets, and methodologies used in this domain, while highlighting strengths, limitations, and gaps in existing literature. The paper also offers insights into future research directions and provides relevant code and datasets for further exploration.
Tabular data, characterized by heterogeneity, sparsity, and complex dependencies, poses unique challenges for modeling. Traditional methods like gradient-boosted decision trees (GBDT) remain state-of-the-art for classification tasks, while deep learning methods such as data transformation, differentiable trees, attention-based methods, and regularization techniques have shown promise in handling tabular data. However, LLMs have emerged as a powerful alternative, offering capabilities such as in-context learning, instruction following, and multi-step reasoning.
The paper discusses various applications of LLMs in tabular data modeling, including prediction, data synthesis, and table understanding. For prediction tasks, LLMs are used to generate text-based inputs, which are then fine-tuned for specific tasks. Data synthesis involves generating synthetic data for augmentation, imputation, and class rebalancing. Table understanding includes tasks like question answering, natural language inference, and Text2SQL, where LLMs are used to translate natural language questions into structured queries.
Key techniques for applying LLMs to tabular data include serialization, table manipulations, prompt engineering, and building end-to-end systems. Serialization involves converting tabular data into text formats, while table manipulations help in compressing and processing large tables. Prompt engineering involves designing effective prompts to guide LLMs in performing specific tasks, and end-to-end systems allow LLMs to interact with databases and execute commands.
The paper also discusses the challenges and opportunities of using LLMs for tabular data modeling, including the need for robustness to table manipulations, the importance of context in task performance, and the potential for LLMs to solve complex tasks beyond traditional NLP applications. The survey concludes with a discussion of future research directions, emphasizing the need for improved performance, better representations, and standardized benchmarks in tabular data modeling.