16 May 2024 | Giordano Cicchetti*, Eleonora Grassucci*, Jihong Park†, Jinho Choi†, Sergio Barbarossa*, and Danilo Comminiello*
This paper introduces a novel framework for semantic image-to-image communication, focusing on delivering meanings behind bits by extracting semantic information from raw data. The proposed framework combines both textual captions and compressed image embeddings to reconstruct the intended image using a latent diffusion model. This approach addresses the limitations of previous methods, which often rely solely on text or image embeddings, leading to significant perceptual differences between the original and reconstructed images.
The framework consists of two main components: semantic extraction at the sender side and data reconstruction at the receiver side. At the sender, an image-to-text (I2T) model and an image encoder are used to generate a textual caption and a latent representation of the image, respectively. These are transmitted over a noisy channel. At the receiver, a latent diffusion model, inspired by Stable Diffusion, is used to regenerate the image from the transmitted text and latent embedding. The model applies a denoising process to the latent vector, conditioned on the text, to reconstruct the image.
Experimental results using the Flickr 8k dataset demonstrate that the proposed method achieves higher perceptual similarities in noisy communication channels compared to a baseline method that communicates only through text. The framework is also shown to be bandwidth-efficient, transmitting only 2.09% of the original image size while maintaining high-quality reconstructions. Future work could focus on reducing the dimensionality of latent embeddings and leveraging large language models to further improve efficiency.This paper introduces a novel framework for semantic image-to-image communication, focusing on delivering meanings behind bits by extracting semantic information from raw data. The proposed framework combines both textual captions and compressed image embeddings to reconstruct the intended image using a latent diffusion model. This approach addresses the limitations of previous methods, which often rely solely on text or image embeddings, leading to significant perceptual differences between the original and reconstructed images.
The framework consists of two main components: semantic extraction at the sender side and data reconstruction at the receiver side. At the sender, an image-to-text (I2T) model and an image encoder are used to generate a textual caption and a latent representation of the image, respectively. These are transmitted over a noisy channel. At the receiver, a latent diffusion model, inspired by Stable Diffusion, is used to regenerate the image from the transmitted text and latent embedding. The model applies a denoising process to the latent vector, conditioned on the text, to reconstruct the image.
Experimental results using the Flickr 8k dataset demonstrate that the proposed method achieves higher perceptual similarities in noisy communication channels compared to a baseline method that communicates only through text. The framework is also shown to be bandwidth-efficient, transmitting only 2.09% of the original image size while maintaining high-quality reconstructions. Future work could focus on reducing the dimensionality of latent embeddings and leveraging large language models to further improve efficiency.