Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning

Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning

2024 | Sungwon Han, Jinsung Yoon, Sercan Ö. Arik, Tomas Pfister
The paper introduces FeatLLM, a novel in-context learning framework that leverages Large Language Models (LLMs) to engineer features for tabular learning tasks. FeatLLM aims to address the limitations of existing LLM-based approaches by eliminating the need for multiple LLM queries per sample at inference time and requiring only API-level access to LLMs. The framework uses LLMs to generate rules that define feature conditions, which are then used to create new binary features. These features are input into a simple downstream machine learning model, such as linear regression, to infer class likelihoods. FeatLLM employs ensemble techniques and bagging to improve robustness and handle large datasets with many features. The proposed method outperforms existing LLM-based approaches, such as TabLLM and STUNT, on various tabular datasets, achieving high performance with fewer training samples. The paper also includes a detailed evaluation, ablation studies, and discussions on the impact of spurious correlations and hyperparameters.The paper introduces FeatLLM, a novel in-context learning framework that leverages Large Language Models (LLMs) to engineer features for tabular learning tasks. FeatLLM aims to address the limitations of existing LLM-based approaches by eliminating the need for multiple LLM queries per sample at inference time and requiring only API-level access to LLMs. The framework uses LLMs to generate rules that define feature conditions, which are then used to create new binary features. These features are input into a simple downstream machine learning model, such as linear regression, to infer class likelihoods. FeatLLM employs ensemble techniques and bagging to improve robustness and handle large datasets with many features. The proposed method outperforms existing LLM-based approaches, such as TabLLM and STUNT, on various tabular datasets, achieving high performance with fewer training samples. The paper also includes a detailed evaluation, ablation studies, and discussions on the impact of spurious correlations and hyperparameters.
Reach us at info@study.space