The paper "MISS: A Generative Pre-training and Fine-tuning Approach for Med-VQA" addresses the challenge of Medical Visual Question Answering (Med-VQA), a multimodal task that requires deep and accurate understanding of medical images. The authors propose a novel framework called Multi-task Self-Supervised-learning-based framework (MISS), which treats Med-VQA as a generative task. Unlike existing methods that treat Med-VQA as an answer classification task, MISS unifies the text encoder and multimodal encoder, aligning image-text features through multi-task learning. Additionally, the paper introduces the Transfer-and-Caption (TransCap) method, which extends the feature space of single-modal image datasets using Large Language Models (LLMs), enabling the use of traditional medical vision datasets for VLP models. The authors conduct extensive experiments and compare their method with existing Med-VQA methods, demonstrating its effectiveness and efficiency with fewer multimodal datasets. The code for MISS is available at https://github.com/TIMMY-CHAN/MISS.git.The paper "MISS: A Generative Pre-training and Fine-tuning Approach for Med-VQA" addresses the challenge of Medical Visual Question Answering (Med-VQA), a multimodal task that requires deep and accurate understanding of medical images. The authors propose a novel framework called Multi-task Self-Supervised-learning-based framework (MISS), which treats Med-VQA as a generative task. Unlike existing methods that treat Med-VQA as an answer classification task, MISS unifies the text encoder and multimodal encoder, aligning image-text features through multi-task learning. Additionally, the paper introduces the Transfer-and-Caption (TransCap) method, which extends the feature space of single-modal image datasets using Large Language Models (LLMs), enabling the use of traditional medical vision datasets for VLP models. The authors conduct extensive experiments and compare their method with existing Med-VQA methods, demonstrating its effectiveness and efficiency with fewer multimodal datasets. The code for MISS is available at https://github.com/TIMMY-CHAN/MISS.git.