15 Mar 2024 | Baoquan Zhang¹, Huaibin Wang¹, Chuyao Luo¹, Xutao Li¹, Guotao Liang¹, Yunming Ye¹, Xiaochen Qi², Yao He²
This paper proposes a novel codebook transfer framework for vector-quantized image modeling (VQIM), called VQCT. The main idea is to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. The framework introduces a pretrained codebook and part-of-speech knowledge as priors, then constructs a vision-related codebook with these priors for codebook transfer. A novel graph convolution-based codebook transfer network is designed to exploit abundant semantic relationships between codes in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that the VQCT method achieves superior VQIM performance over previous state-of-the-art methods. The key contributions include proposing a new perspective for alleviating codebook collapse by transferring codebooks from language models, constructing vision-related codebooks using part-of-speech knowledge, and designing a graph convolution-based codebook transfer network. The method is effective in enhancing codebook learning and achieving cooperative optimization between codes. The framework is shown to be effective in image synthesis tasks and can be integrated into existing VQIM methods. The results demonstrate that the VQCT method outperforms existing methods in terms of image reconstruction quality and codebook learning. The method also shows that pretrained codebooks from language models can be effectively transferred to VQIM for robust codebook learning. The framework is generalizable to other VQIM methods and can be applied to various image synthesis tasks. The results indicate that the VQCT method is effective in aligning vision and language, and can be used for semantic image synthesis. The method is shown to alleviate the codebook collapse issue by enabling cooperative optimization between codes. The framework is effective in enhancing codebook learning and achieving high-quality image synthesis.This paper proposes a novel codebook transfer framework for vector-quantized image modeling (VQIM), called VQCT. The main idea is to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. The framework introduces a pretrained codebook and part-of-speech knowledge as priors, then constructs a vision-related codebook with these priors for codebook transfer. A novel graph convolution-based codebook transfer network is designed to exploit abundant semantic relationships between codes in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that the VQCT method achieves superior VQIM performance over previous state-of-the-art methods. The key contributions include proposing a new perspective for alleviating codebook collapse by transferring codebooks from language models, constructing vision-related codebooks using part-of-speech knowledge, and designing a graph convolution-based codebook transfer network. The method is effective in enhancing codebook learning and achieving cooperative optimization between codes. The framework is shown to be effective in image synthesis tasks and can be integrated into existing VQIM methods. The results demonstrate that the VQCT method outperforms existing methods in terms of image reconstruction quality and codebook learning. The method also shows that pretrained codebooks from language models can be effectively transferred to VQIM for robust codebook learning. The framework is generalizable to other VQIM methods and can be applied to various image synthesis tasks. The results indicate that the VQCT method is effective in aligning vision and language, and can be used for semantic image synthesis. The method is shown to alleviate the codebook collapse issue by enabling cooperative optimization between codes. The framework is effective in enhancing codebook learning and achieving high-quality image synthesis.