Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Few-shot Adaptation of Multi-modal Foundation Models: A Survey

4 Jan 2024 | Fan Liu, Tianshu Zhang, Wenwen Dai, Wenwen Cai, Xiaocong Zhou, and Delong Chen
This paper provides a comprehensive survey of few-shot adaptation methods for multi-modal foundation models, such as CLIP, which are increasingly used in various downstream tasks. The authors categorize these methods into three main approaches: prompt-based, adapter-based, and external knowledge-based. They review 11 commonly used datasets and four experimental setups to evaluate the performance of these methods under few-shot conditions. The paper also discusses common shortcomings of existing methods, including ineffective adaptation of upstream and downstream domain distributions, lack of adaptability in model selection, and insufficient utilization of data and knowledge. To address these issues, the authors derive a generalization error bound for multi-modal models, revealing that the generalization error is constrained by domain gap, model capacity, and sample size. Based on this, they propose three solutions: adaptive domain generalization, adaptive model selection, and adaptive knowledge utilization. The paper aims to provide insights and guidance for future research in few-shot adaptation of multi-modal foundation models.This paper provides a comprehensive survey of few-shot adaptation methods for multi-modal foundation models, such as CLIP, which are increasingly used in various downstream tasks. The authors categorize these methods into three main approaches: prompt-based, adapter-based, and external knowledge-based. They review 11 commonly used datasets and four experimental setups to evaluate the performance of these methods under few-shot conditions. The paper also discusses common shortcomings of existing methods, including ineffective adaptation of upstream and downstream domain distributions, lack of adaptability in model selection, and insufficient utilization of data and knowledge. To address these issues, the authors derive a generalization error bound for multi-modal models, revealing that the generalization error is constrained by domain gap, model capacity, and sample size. Based on this, they propose three solutions: adaptive domain generalization, adaptive model selection, and adaptive knowledge utilization. The paper aims to provide insights and guidance for future research in few-shot adaptation of multi-modal foundation models.
Reach us at info@study.space