This paper proposes a novel method for open-domain visual-audio generation using diffusion latent aligners. The approach bridges pre-trained single-modality generation models through a shared latent representation space, leveraging the ImageBind model to align visual and audio modalities. The method enables joint video-audio generation, video-to-audio generation, audio-to-video generation, and image-to-audio generation. By using a diffusion latent aligner, the model can guide the denoising process with information from another modality, improving the alignment between generated content and input conditions. The approach does not require training on large-scale datasets, making it resource-efficient. The method is validated on four tasks, demonstrating superior performance compared to baseline approaches. The key contributions include the introduction of a diffusion latent aligner for multimodal alignment, the development of a versatile generation paradigm, and the first work for text-guided joint video-audio generation. The method achieves high-quality audio-visual generation with strong semantic alignment and temporal consistency.This paper proposes a novel method for open-domain visual-audio generation using diffusion latent aligners. The approach bridges pre-trained single-modality generation models through a shared latent representation space, leveraging the ImageBind model to align visual and audio modalities. The method enables joint video-audio generation, video-to-audio generation, audio-to-video generation, and image-to-audio generation. By using a diffusion latent aligner, the model can guide the denoising process with information from another modality, improving the alignment between generated content and input conditions. The approach does not require training on large-scale datasets, making it resource-efficient. The method is validated on four tasks, demonstrating superior performance compared to baseline approaches. The key contributions include the introduction of a diffusion latent aligner for multimodal alignment, the development of a versatile generation paradigm, and the first work for text-guided joint video-audio generation. The method achieves high-quality audio-visual generation with strong semantic alignment and temporal consistency.