The paper introduces a novel task called language-driven video inpainting, which uses natural language instructions to guide the inpainting process, overcoming the limitations of traditional methods that rely on manually labeled binary masks. The authors present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation. They propose a diffusion-based language-driven video inpainting framework, LGVI, and its interactive extension LGVI-I, which integrates Multimodal Large Language Models (MLLMs) to understand and execute complex language-based requests effectively. The comprehensive results demonstrate the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. The key contributions include the introduction of the language-driven video inpainting task, the ROVI dataset, and the LGVI and LGVI-I models. The paper also discusses related work, dataset construction, experimental settings, and future directions, highlighting the potential social impacts and challenges in handling ambiguity, real-time processing, and scalability.The paper introduces a novel task called language-driven video inpainting, which uses natural language instructions to guide the inpainting process, overcoming the limitations of traditional methods that rely on manually labeled binary masks. The authors present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation. They propose a diffusion-based language-driven video inpainting framework, LGVI, and its interactive extension LGVI-I, which integrates Multimodal Large Language Models (MLLMs) to understand and execute complex language-based requests effectively. The comprehensive results demonstrate the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. The key contributions include the introduction of the language-driven video inpainting task, the ROVI dataset, and the LGVI and LGVI-I models. The paper also discusses related work, dataset construction, experimental settings, and future directions, highlighting the potential social impacts and challenges in handling ambiguity, real-time processing, and scalability.