14 Jul 2024 | Yuyang Du, Kexin Chen, Yue Zhan, Chang Han Low, Tao You, Mobarakol Islam, Ziyu Guo, Yueming Jin, Guangyong Chen, Pheng Ann Heng
This paper proposes a novel approach for surgical visual question answering (VQA) using a multi-teacher continual learning (CL) framework enhanced by a multimodal large language model (LLM). The method addresses two critical challenges in the surgical domain: domain shifts due to diverse surgical operations and severe data imbalance from uneven presence of surgical instruments or activities. The proposed framework leverages a multimodal LLM as an additional teacher to bridge knowledge gaps, and an adaptive weight assignment method to balance the generalization ability of the LLM with the domain expertise of the old CL model. The framework also introduces a novel data processing method to transform complex LLM embeddings into logits compatible with the CL framework. The effectiveness of the method is validated through extensive experiments on two newly constructed surgical VQA datasets, which are significantly different from existing ones and provide valuable resources for future research. The results demonstrate that the proposed method outperforms other advanced CL schemes in terms of accuracy and F-score across different time periods. The method's ability to handle domain shifts and data imbalance is attributed to the integration of LLMs, which provide general medical knowledge, and the adaptive weight assignment, which balances the expertise of the LLM and the old CL model. The paper also discusses the importance of using LLMs in CL studies and highlights the potential for further research in decomposing representations into spatial and temporal spaces to alleviate model forgetting.This paper proposes a novel approach for surgical visual question answering (VQA) using a multi-teacher continual learning (CL) framework enhanced by a multimodal large language model (LLM). The method addresses two critical challenges in the surgical domain: domain shifts due to diverse surgical operations and severe data imbalance from uneven presence of surgical instruments or activities. The proposed framework leverages a multimodal LLM as an additional teacher to bridge knowledge gaps, and an adaptive weight assignment method to balance the generalization ability of the LLM with the domain expertise of the old CL model. The framework also introduces a novel data processing method to transform complex LLM embeddings into logits compatible with the CL framework. The effectiveness of the method is validated through extensive experiments on two newly constructed surgical VQA datasets, which are significantly different from existing ones and provide valuable resources for future research. The results demonstrate that the proposed method outperforms other advanced CL schemes in terms of accuracy and F-score across different time periods. The method's ability to handle domain shifts and data imbalance is attributed to the integration of LLMs, which provide general medical knowledge, and the adaptive weight assignment, which balances the expertise of the LLM and the old CL model. The paper also discusses the importance of using LLMs in CL studies and highlights the potential for further research in decomposing representations into spatial and temporal spaces to alleviate model forgetting.