[slides and audio] Large Language Models(LLMs) on Tabular Data%3A Prediction%2C Generation%2C and Understanding

This survey provides a comprehensive overview of the application of Large Language Models (LLMs) in tabular data, focusing on prediction, data generation, and table understanding. It addresses the unique challenges and opportunities in these tasks, such as handling one-hot encoding, context-based interconnection, order-invariance, and lack of prior knowledge in tabular data. The survey highlights the strengths and limitations of traditional methods like tree-based ensemble models and deep learning approaches, including data transformation, differentiable trees, attention mechanisms, and regularization techniques. It also discusses the development of LLMs, from Statistical Language Models (SLM) to Neural Language Models (NLM) and Pretrained Language Models (PLM), emphasizing the emergence of LLMs with advanced capabilities like in-context learning, instruction following, and multi-step reasoning. The survey categorizes key techniques for LLMs' applications on tabular data, including serialization, table manipulations, prompt engineering, and end-to-end systems. It provides a taxonomy of metrics, datasets, and techniques for each application, offering insights into future research directions. The contributions of the survey include a formal breakdown of key techniques, a taxonomy of metrics and datasets, and recommendations for benchmark methods and references.This survey provides a comprehensive overview of the application of Large Language Models (LLMs) in tabular data, focusing on prediction, data generation, and table understanding. It addresses the unique challenges and opportunities in these tasks, such as handling one-hot encoding, context-based interconnection, order-invariance, and lack of prior knowledge in tabular data. The survey highlights the strengths and limitations of traditional methods like tree-based ensemble models and deep learning approaches, including data transformation, differentiable trees, attention mechanisms, and regularization techniques. It also discusses the development of LLMs, from Statistical Language Models (SLM) to Neural Language Models (NLM) and Pretrained Language Models (PLM), emphasizing the emergence of LLMs with advanced capabilities like in-context learning, instruction following, and multi-step reasoning. The survey categorizes key techniques for LLMs' applications on tabular data, including serialization, table manipulations, prompt engineering, and end-to-end systems. It provides a taxonomy of metrics, datasets, and techniques for each application, offering insights into future research directions. The contributions of the survey include a formal breakdown of key techniques, a taxonomy of metrics and datasets, and recommendations for benchmark methods and references.

Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey

07/2024 | Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, Christos Faloutsos