16 Jan 2024 | Zhixuan Liu, Peter Schaldenbrand, Beverley-Claire Okogwu, Wenxuan Peng, Youngsik Yun, Andrew Hundt, Jihie Kim, Jean Oh
SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
This paper introduces SCoFT, a novel self-contrastive fine-tuning method to improve the cultural representation and reduce harmful stereotypes in image generation. The authors address the issue of biased image generation by collecting a culturally representative dataset called CCUB, which includes images and captions from five different cultures. They propose SCoFT, a technique that leverages the model's known biases to self-improve, preventing overfitting on small datasets and shifting the generated distribution away from misrepresentations encoded in a pretrained model.
The study shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes compared to the Stable Diffusion baseline, which is further improved with SCoFT. The results of a user study with 51 participants from five different countries indicate that SCoFT significantly reduces offensiveness and increases cultural relevance of generated images.
The authors also propose a novel approach to computing perceptual loss on decoded images, and a method to contrastively use Stable Diffusion's misrepresentations of culture to refine itself. They introduce a Self-Contrastive Perceptual Loss, which uses data produced from a pretrained model as negative examples along with a veritable dataset as positive examples to push the model towards producing from the positive distribution.
The CCUB dataset consists of 150-200 images for each of five cultures, collected by people who self-selectedly affiliated with that culture. The dataset is used to fine-tune the model with the new dataset, which is designed to encode high-level information into a pre-trained model using small datasets.
The authors evaluate their results using a user survey and automatic metrics, showing that SCoFT significantly improves the cultural representation and reduces harmful stereotypes in image generation. The study highlights the importance of accurate cultural representation in AI-generated imagery and proposes SCoFT as a promising technique for achieving this goal.SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
This paper introduces SCoFT, a novel self-contrastive fine-tuning method to improve the cultural representation and reduce harmful stereotypes in image generation. The authors address the issue of biased image generation by collecting a culturally representative dataset called CCUB, which includes images and captions from five different cultures. They propose SCoFT, a technique that leverages the model's known biases to self-improve, preventing overfitting on small datasets and shifting the generated distribution away from misrepresentations encoded in a pretrained model.
The study shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes compared to the Stable Diffusion baseline, which is further improved with SCoFT. The results of a user study with 51 participants from five different countries indicate that SCoFT significantly reduces offensiveness and increases cultural relevance of generated images.
The authors also propose a novel approach to computing perceptual loss on decoded images, and a method to contrastively use Stable Diffusion's misrepresentations of culture to refine itself. They introduce a Self-Contrastive Perceptual Loss, which uses data produced from a pretrained model as negative examples along with a veritable dataset as positive examples to push the model towards producing from the positive distribution.
The CCUB dataset consists of 150-200 images for each of five cultures, collected by people who self-selectedly affiliated with that culture. The dataset is used to fine-tune the model with the new dataset, which is designed to encode high-level information into a pre-trained model using small datasets.
The authors evaluate their results using a user survey and automatic metrics, showing that SCoFT significantly improves the cultural representation and reduces harmful stereotypes in image generation. The study highlights the importance of accurate cultural representation in AI-generated imagery and proposes SCoFT as a promising technique for achieving this goal.