4 Jan 2024 | Jing Wu*, Suiyao Chen*, Qi Zhao, Renat Sergazinov, Chen Li, Shengjie Liu, Chongchao Zhao, Tianpei Xie, Hanqing Guo, Cheng Ji, Daniel Cociorva, Hakan Brunzell
SwitchTab is a novel self-supervised method designed to capture latent dependencies in tabular data. It employs an asymmetric encoder-decoder framework to decouple mutual and salient features among data pairs, resulting in more representative embeddings that improve decision boundaries and downstream task performance. The method is validated through extensive experiments across various domains, demonstrating superior performance in end-to-end prediction tasks with fine-tuning. Pre-trained salient embeddings can be used as plug-and-play features to enhance traditional classification methods like Logistic Regression and XGBoost. SwitchTab also enables explainable representations through visualization of decoupled mutual and salient features in the latent space.
SwitchTab introduces a feature corruption strategy to learn robust embeddings for downstream tasks. It applies feature corruption to generate corrupted data, encodes it using an encoder, and decouples the feature vectors into mutual and salient components using projectors. The decoder then reconstructs the data, with the salient feature dominating the sample type and the mutual feature providing common information that is switchable between two samples. The method is trained in both self-supervised and labeled settings, showing strong performance across diverse training scenarios.
SwitchTab's effectiveness is demonstrated through experiments on various datasets, including standard benchmarks and additional public datasets. It outperforms traditional and deep learning methods in most classification tasks, and its pre-trained embeddings enhance the performance of traditional models like XGBoost. The method's ability to decouple mutual and salient features contributes to its effectiveness in representation learning for tabular data. The results show that SwitchTab achieves optimal or near-optimal performance in most classification tasks, while traditional methods like XGBoost or CatBoost still dominate in regression tasks. The method's ability to generate structured and explainable representations makes it a valuable tool for tabular data learning.SwitchTab is a novel self-supervised method designed to capture latent dependencies in tabular data. It employs an asymmetric encoder-decoder framework to decouple mutual and salient features among data pairs, resulting in more representative embeddings that improve decision boundaries and downstream task performance. The method is validated through extensive experiments across various domains, demonstrating superior performance in end-to-end prediction tasks with fine-tuning. Pre-trained salient embeddings can be used as plug-and-play features to enhance traditional classification methods like Logistic Regression and XGBoost. SwitchTab also enables explainable representations through visualization of decoupled mutual and salient features in the latent space.
SwitchTab introduces a feature corruption strategy to learn robust embeddings for downstream tasks. It applies feature corruption to generate corrupted data, encodes it using an encoder, and decouples the feature vectors into mutual and salient components using projectors. The decoder then reconstructs the data, with the salient feature dominating the sample type and the mutual feature providing common information that is switchable between two samples. The method is trained in both self-supervised and labeled settings, showing strong performance across diverse training scenarios.
SwitchTab's effectiveness is demonstrated through experiments on various datasets, including standard benchmarks and additional public datasets. It outperforms traditional and deep learning methods in most classification tasks, and its pre-trained embeddings enhance the performance of traditional models like XGBoost. The method's ability to decouple mutual and salient features contributes to its effectiveness in representation learning for tabular data. The results show that SwitchTab achieves optimal or near-optimal performance in most classification tasks, while traditional methods like XGBoost or CatBoost still dominate in regression tasks. The method's ability to generate structured and explainable representations makes it a valuable tool for tabular data learning.