This paper introduces a new task, language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that rely on manually labeled binary masks, which are often tedious and labor-intensive. The authors present the ROVI dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. They also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating multimodal large language models (MLLMs) to understand and execute complex language-based inpainting requests effectively. The comprehensive results demonstrate the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. The dataset, code, and models are publicly available at https://github.com/jianzongwu/LanguageDriven-Video-Inpainting.
The paper discusses the challenges of traditional video inpainting methods, which depend on manually annotated binary masks. While automatic labeling tools can help, they often require manual refinement. The authors propose a new dataset, ROVI, which includes original videos, removal expressions, and inpainted videos. They also introduce a new model, LGVI, which is built on diffusion-based generative models. The model is extended to LGVI-I (Interactive) to handle more complex conversation-like interactions. The model is trained in an end-to-end manner, enabling it to interpret complex instructions accurately and produce appropriate inpainting results and relevant responses within the conversational context.
The paper also discusses related work in video inpainting and language-driven visual editing. It highlights the limitations of existing methods and the potential of diffusion-based models for image and video editing. The authors propose a new dataset and model that leverage MLLMs to improve language guidance for interactive video inpainting. The results show that the proposed model outperforms existing methods in various metrics, demonstrating its effectiveness and robustness. The paper concludes that the proposed benchmark and baselines can provide valuable insights into multimodal models of low-level vision. The authors also discuss challenges such as scalability and generalization of the model.This paper introduces a new task, language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that rely on manually labeled binary masks, which are often tedious and labor-intensive. The authors present the ROVI dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. They also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating multimodal large language models (MLLMs) to understand and execute complex language-based inpainting requests effectively. The comprehensive results demonstrate the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. The dataset, code, and models are publicly available at https://github.com/jianzongwu/LanguageDriven-Video-Inpainting.
The paper discusses the challenges of traditional video inpainting methods, which depend on manually annotated binary masks. While automatic labeling tools can help, they often require manual refinement. The authors propose a new dataset, ROVI, which includes original videos, removal expressions, and inpainted videos. They also introduce a new model, LGVI, which is built on diffusion-based generative models. The model is extended to LGVI-I (Interactive) to handle more complex conversation-like interactions. The model is trained in an end-to-end manner, enabling it to interpret complex instructions accurately and produce appropriate inpainting results and relevant responses within the conversational context.
The paper also discusses related work in video inpainting and language-driven visual editing. It highlights the limitations of existing methods and the potential of diffusion-based models for image and video editing. The authors propose a new dataset and model that leverage MLLMs to improve language guidance for interactive video inpainting. The results show that the proposed model outperforms existing methods in various metrics, demonstrating its effectiveness and robustness. The paper concludes that the proposed benchmark and baselines can provide valuable insights into multimodal models of low-level vision. The authors also discuss challenges such as scalability and generalization of the model.