The paper introduces ReWaS, a novel method for generating sound from text and video inputs. ReWaS addresses the challenges of generating comprehensive and temporally aligned audio from text alone, as well as the limitations of video-to-sound generation methods in prioritizing specific objects within a scene. The method integrates video as a conditional control for a text-to-audio generation model, using energy estimation from the video to complement the audio structure. This approach allows for more flexible control over the energy, environment, and primary sound source, enhancing the quality, controllability, and training efficiency of the generated audio. Experimental results on the VGGSound and GreatestHits datasets demonstrate the superiority of ReWaS in terms of fidelity, structure prediction, and human evaluation metrics. The method also shows effective temporal alignment and the ability to capture complex visual dynamics, making it a practical solution for various applications such as sound effects creation and post-production in filmmaking.The paper introduces ReWaS, a novel method for generating sound from text and video inputs. ReWaS addresses the challenges of generating comprehensive and temporally aligned audio from text alone, as well as the limitations of video-to-sound generation methods in prioritizing specific objects within a scene. The method integrates video as a conditional control for a text-to-audio generation model, using energy estimation from the video to complement the audio structure. This approach allows for more flexible control over the energy, environment, and primary sound source, enhancing the quality, controllability, and training efficiency of the generated audio. Experimental results on the VGGSound and GreatestHits datasets demonstrate the superiority of ReWaS in terms of fidelity, structure prediction, and human evaluation metrics. The method also shows effective temporal alignment and the ability to capture complex visual dynamics, making it a practical solution for various applications such as sound effects creation and post-production in filmmaking.