Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Few-shot Adaptation of Multi-modal Foundation Models: A Survey

4 Jan 2024 | Fan Liu, Tianshu Zhang, Wenwen Dai, Wenwen Cai, Xiaocong Zhou, and Delong Chen
This survey provides a comprehensive overview of few-shot adaptation methods for multi-modal foundation models. Multi-modal models, such as CLIP, have become the new generation of visual foundation models, offering robust and aligned semantic representations learned from billions of internet image-text pairs. However, their performance is often limited in fine-grained domains like medical imaging and remote sensing. To address this, researchers have explored three main approaches: prompt-based, adapter-based, and external knowledge-based methods. These methods aim to improve the generalization performance of multi-modal models in few-shot scenarios. Prompt-based methods, such as CoOp and CoCoOp, use learnable vectors to adapt models to specific tasks. Adapter-based methods, like CLIP-Adapter and Tip-Adapter, introduce small adapter structures to fine-tune models with limited data. External knowledge-based methods, such as CuPL and SgVA-CLIP, leverage external knowledge to enhance model performance. The survey also derives a generalization error bound for multi-modal models, revealing that performance is constrained by domain gap, model capacity, and sample size. The survey reviews 11 commonly used datasets and four experimental setups to evaluate the performance of these methods. It highlights the effectiveness of prompt-based, adapter-based, and external knowledge-based methods in improving generalization. However, existing methods face challenges such as ineffective adaptation of domain distributions, lack of model adaptability, and insufficient use of data and knowledge. To address these challenges, the survey proposes three solutions: adaptive domain generalization, adaptive model selection, and adaptive knowledge utilization. These approaches aim to enhance the performance of multi-modal foundation models in few-shot scenarios. The survey also discusses the importance of these methods in improving the generalization ability of foundation models and their potential as important research directions in the future.This survey provides a comprehensive overview of few-shot adaptation methods for multi-modal foundation models. Multi-modal models, such as CLIP, have become the new generation of visual foundation models, offering robust and aligned semantic representations learned from billions of internet image-text pairs. However, their performance is often limited in fine-grained domains like medical imaging and remote sensing. To address this, researchers have explored three main approaches: prompt-based, adapter-based, and external knowledge-based methods. These methods aim to improve the generalization performance of multi-modal models in few-shot scenarios. Prompt-based methods, such as CoOp and CoCoOp, use learnable vectors to adapt models to specific tasks. Adapter-based methods, like CLIP-Adapter and Tip-Adapter, introduce small adapter structures to fine-tune models with limited data. External knowledge-based methods, such as CuPL and SgVA-CLIP, leverage external knowledge to enhance model performance. The survey also derives a generalization error bound for multi-modal models, revealing that performance is constrained by domain gap, model capacity, and sample size. The survey reviews 11 commonly used datasets and four experimental setups to evaluate the performance of these methods. It highlights the effectiveness of prompt-based, adapter-based, and external knowledge-based methods in improving generalization. However, existing methods face challenges such as ineffective adaptation of domain distributions, lack of model adaptability, and insufficient use of data and knowledge. To address these challenges, the survey proposes three solutions: adaptive domain generalization, adaptive model selection, and adaptive knowledge utilization. These approaches aim to enhance the performance of multi-modal foundation models in few-shot scenarios. The survey also discusses the importance of these methods in improving the generalization ability of foundation models and their potential as important research directions in the future.
Reach us at info@study.space