This paper proposes a pre-training and fine-tuning framework called MISS for medical visual question answering (Med-VQA). Unlike existing methods that treat Med-VQA as an answer classification task, MISS treats it as a generative task. The framework unifies the text encoder and multimodal encoder, aligning image-text features through multi-task learning. Additionally, a Transfer-and-Caption (TransCap) method is introduced to extend the feature space of single-modal image datasets using large language models (LLMs), enabling traditional medical vision-field task data to be applied to vision-language pre-training (VLP). The framework achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models. The method is evaluated on two Med-VQA benchmarks, VQA-RAD and Slake, showing superior performance in open-ended questions. The results indicate that the JTM encoder and TransCap method significantly improve the performance of Med-VQA models. The code is available at https://github.com/TIMMY-CHAN/MISS.git. Keywords: Medical visual question answering · Vision-language pre-training · Multi-modal learning.This paper proposes a pre-training and fine-tuning framework called MISS for medical visual question answering (Med-VQA). Unlike existing methods that treat Med-VQA as an answer classification task, MISS treats it as a generative task. The framework unifies the text encoder and multimodal encoder, aligning image-text features through multi-task learning. Additionally, a Transfer-and-Caption (TransCap) method is introduced to extend the feature space of single-modal image datasets using large language models (LLMs), enabling traditional medical vision-field task data to be applied to vision-language pre-training (VLP). The framework achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models. The method is evaluated on two Med-VQA benchmarks, VQA-RAD and Slake, showing superior performance in open-ended questions. The results indicate that the JTM encoder and TransCap method significantly improve the performance of Med-VQA models. The code is available at https://github.com/TIMMY-CHAN/MISS.git. Keywords: Medical visual question answering · Vision-language pre-training · Multi-modal learning.