[slides and audio] CoFRIDA%3A Self-Supervised Fine-Tuning for Human-Robot Co-Painting

CoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting **Abstract:** This paper introduces CoFRIDA, a collaborative framework for human-robot co-painting that addresses the limitations of existing systems like FRIDA. CoFRIDA enables the robot to modify and engage with content already painted by a human collaborator, enhancing text-image alignment through pre-trained text-to-image models. However, these models often perform poorly in real-world co-painting due to their inability to understand the robot's constraints and capabilities. To overcome this, CoFRIDA proposes a self-supervised fine-tuning procedure that adapts pre-trained models to generate content within the robot's capabilities and perform co-painting. This approach successfully reduces the Semantic Sim2Real Gap, allowing CoFRIDA to create paintings that better match the input text prompts, both from a blank canvas and one with human-created work. The open-source CoFRIDA system demonstrates promising results in co-painting tasks, showcasing its effectiveness in reducing the Sim2Real gap and enabling human-robot collaborative art creation. **Introduction:** Recent advancements in text-to-image synthesis have sparked interest in using these technologies for digital content generation, including art creation with robots. While FRIDA, a robotic framework for painting, allows users to input language descriptions or images, it lacks the ability to engage with existing content, limiting its co-creative potential. CoFRIDA addresses this by enabling the robot to add content to an existing canvas, guided by the user's initial prompt. This co-painting task is distinct from image editing problems like in-painting, as it requires preserving and engaging with the full canvas rather than making radical changes. **Method:** CoFRIDA consists of three main components: the Co-Painting Module, FRIDA, and a self-supervised fine-tuning method. The Co-Painting Module generates images of how the robot should complete a painting based on the current canvas and a text description. FRIDA plans actions to guide the robot to achieve the desired painting. The self-supervised fine-tuning method creates training data by simulating paintings using FRIDA and selectively removing strokes to form partial paintings. This data is then used to fine-tune a pre-trained model, such as Instruct-Pix2Pix, to generate content that aligns with the robot's capabilities. **Evaluation:** Experiments show that CoFRIDA outperforms FRIDA and its baseline methods in terms of text-image alignment and semantic similarity. User preference studies conducted on Amazon Mechanical Turk (MTurk) participants indicate that CoFRIDA's outputs are more similar to the text prompts compared to FRIDA and its unfine-tuned version. Additionally, CoFRIDA can handle multiple turns of interaction, accommodating iterative changes without completely overwriting the human's prior work. **Results:** CoFRIDA successfully co-paints with various media, including markers and paintbrushes, and can use content on canvases thatCoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting **Abstract:** This paper introduces CoFRIDA, a collaborative framework for human-robot co-painting that addresses the limitations of existing systems like FRIDA. CoFRIDA enables the robot to modify and engage with content already painted by a human collaborator, enhancing text-image alignment through pre-trained text-to-image models. However, these models often perform poorly in real-world co-painting due to their inability to understand the robot's constraints and capabilities. To overcome this, CoFRIDA proposes a self-supervised fine-tuning procedure that adapts pre-trained models to generate content within the robot's capabilities and perform co-painting. This approach successfully reduces the Semantic Sim2Real Gap, allowing CoFRIDA to create paintings that better match the input text prompts, both from a blank canvas and one with human-created work. The open-source CoFRIDA system demonstrates promising results in co-painting tasks, showcasing its effectiveness in reducing the Sim2Real gap and enabling human-robot collaborative art creation. **Introduction:** Recent advancements in text-to-image synthesis have sparked interest in using these technologies for digital content generation, including art creation with robots. While FRIDA, a robotic framework for painting, allows users to input language descriptions or images, it lacks the ability to engage with existing content, limiting its co-creative potential. CoFRIDA addresses this by enabling the robot to add content to an existing canvas, guided by the user's initial prompt. This co-painting task is distinct from image editing problems like in-painting, as it requires preserving and engaging with the full canvas rather than making radical changes. **Method:** CoFRIDA consists of three main components: the Co-Painting Module, FRIDA, and a self-supervised fine-tuning method. The Co-Painting Module generates images of how the robot should complete a painting based on the current canvas and a text description. FRIDA plans actions to guide the robot to achieve the desired painting. The self-supervised fine-tuning method creates training data by simulating paintings using FRIDA and selectively removing strokes to form partial paintings. This data is then used to fine-tune a pre-trained model, such as Instruct-Pix2Pix, to generate content that aligns with the robot's capabilities. **Evaluation:** Experiments show that CoFRIDA outperforms FRIDA and its baseline methods in terms of text-image alignment and semantic similarity. User preference studies conducted on Amazon Mechanical Turk (MTurk) participants indicate that CoFRIDA's outputs are more similar to the text prompts compared to FRIDA and its unfine-tuned version. Additionally, CoFRIDA can handle multiple turns of interaction, accommodating iterative changes without completely overwriting the human's prior work. **Results:** CoFRIDA successfully co-paints with various media, including markers and paintbrushes, and can use content on canvases that

CoFRIDA: Self-Supervised Fine-Tuning for Human-Robot Co-Painting

21 Feb 2024 | Peter Schaldenbrand, Gaurav Parmar, Jun-Yan Zhu, James McCann, and Jean Oh