The paper introduces a novel benchmark, T2AV-BENCH, for text-to-audio (TTA) generation aligned with videos, and presents three new metrics—Frechet Audio-Visual Distance (FAVD), Frechet Audio-Text Distance (FATD), and Frechet Audio-(Video-Text) Distance (FAVTD)—to evaluate visual alignment and temporal consistency. To address the challenge of maintaining synchronization between generated audio and video content, the authors propose T2AV, a simple yet effective latent diffusion model that integrates visual-aligned text embeddings as conditional inputs. T2AV employs a temporal multi-head attention transformer to extract temporal nuances from video data and an Audio-Visual ControlNet to merge visual representations with text embeddings. Extensive experiments on the AudioCaps dataset and T2AV-BENCH demonstrate that T2AV outperforms previous baselines in terms of all metrics, achieving superior performance in generating high-fidelity audio aligned with videos. The paper also includes ablation studies and exploration of various aspects of the proposed method, validating the importance of visual-aligned CLAP, Audio-Visual ControlNet, training data scale, and latent diffusion tuning.The paper introduces a novel benchmark, T2AV-BENCH, for text-to-audio (TTA) generation aligned with videos, and presents three new metrics—Frechet Audio-Visual Distance (FAVD), Frechet Audio-Text Distance (FATD), and Frechet Audio-(Video-Text) Distance (FAVTD)—to evaluate visual alignment and temporal consistency. To address the challenge of maintaining synchronization between generated audio and video content, the authors propose T2AV, a simple yet effective latent diffusion model that integrates visual-aligned text embeddings as conditional inputs. T2AV employs a temporal multi-head attention transformer to extract temporal nuances from video data and an Audio-Visual ControlNet to merge visual representations with text embeddings. Extensive experiments on the AudioCaps dataset and T2AV-BENCH demonstrate that T2AV outperforms previous baselines in terms of all metrics, achieving superior performance in generating high-fidelity audio aligned with videos. The paper also includes ablation studies and exploration of various aspects of the proposed method, validating the importance of visual-aligned CLAP, Audio-Visual ControlNet, training data scale, and latent diffusion tuning.