[slides] Slicedit%3A Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

Slicedit is a zero-shot video editing method that uses a pretrained text-to-image diffusion model to edit videos based on text prompts. The method processes both spatial and spatiotemporal slices of the video, leveraging the pretrained model's ability to denoise images to enhance temporal consistency. By applying the model to spatiotemporal slices, Slicedit can edit specific regions of a video while preserving the rest. The method uses an extended attention mechanism to process multiple frames together, ensuring temporal consistency. It also applies the pretrained model to spatiotemporal slices, which share characteristics with natural images, to further enhance consistency. Slicedit has been tested on a variety of real-world videos, demonstrating its effectiveness in editing videos with complex motion and occlusions. The method outperforms existing zero-shot video editing techniques in terms of preserving the original structure and motion of the video while adhering to the text prompt. The results show that Slicedit can edit videos while maintaining temporal consistency, even in cases with strong nonrigid motion or occlusions. The method is evaluated using quantitative metrics and a user study, which confirm its superiority over competing methods. Slicedit is limited to structure-preserving edits and cannot modify certain elements of the video, such as changing a dog into an elephant. The method is supported by a variety of references and has been tested on a range of video datasets.Slicedit is a zero-shot video editing method that uses a pretrained text-to-image diffusion model to edit videos based on text prompts. The method processes both spatial and spatiotemporal slices of the video, leveraging the pretrained model's ability to denoise images to enhance temporal consistency. By applying the model to spatiotemporal slices, Slicedit can edit specific regions of a video while preserving the rest. The method uses an extended attention mechanism to process multiple frames together, ensuring temporal consistency. It also applies the pretrained model to spatiotemporal slices, which share characteristics with natural images, to further enhance consistency. Slicedit has been tested on a variety of real-world videos, demonstrating its effectiveness in editing videos with complex motion and occlusions. The method outperforms existing zero-shot video editing techniques in terms of preserving the original structure and motion of the video while adhering to the text prompt. The results show that Slicedit can edit videos while maintaining temporal consistency, even in cases with strong nonrigid motion or occlusions. The method is evaluated using quantitative metrics and a user study, which confirm its superiority over competing methods. Slicedit is limited to structure-preserving edits and cannot modify certain elements of the video, such as changing a dog into an elephant. The method is supported by a variety of references and has been tested on a range of video datasets.

Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices

2024 | Nathaniel Cohen, Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, Tomer Michaeli