6 Aug 2024 | Dongfu Jiang*, Max Ku*, Tianle Li*, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhui Chen
**GenAI Arena: An Open Evaluation Platform for Generative Models**
Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhu Chen
University of Waterloo
{dongfu.jiang, m3ku, t291i, wenhuchen}@uwaterloo.ca
https://hf.co/spaces/TIGER-Lab/GenAI-Arena
**Abstract**
Generative AI has made significant strides in fields like image and video generation, driven by innovative algorithms, architectures, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the lack of trustworthy evaluation metrics. Current automatic assessments, such as FID, CLIP, and FVD, often fail to capture nuanced quality and user satisfaction. This paper introduces GenAI-Arena, an open platform for evaluating image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas: text-to-image generation, text-to-video generation, and image editing. Currently, 27 open-source generative models are covered. GenAI-Arena has been operational for four months, amassing over 6000 votes from the community. The platform includes an Elo rating system to rank models based on user preferences. To promote research in building model-based evaluation metrics, GenAI-Arena releases GenAI-Bench, a cleaned version of preference data for three tasks. The accuracy of existing multimodal models like GPT-4o in assessing visual content is low, achieving only 49.19% accuracy across the three tasks.
**Introduction**
Image and video generation technologies have seen rapid advancements, leading to widespread applications in various domains. However, navigating and assessing the performance of numerous models remains challenging. Traditional evaluation metrics like PSNR, SSIM, LPIPS, and FID offer specific insights but often fall short in providing a comprehensive assessment of overall model performance, especially in subjective qualities like aesthetics and user satisfaction.
To address these challenges, GenAI-Arena is designed to enable fair and interactive evaluation. It offers a dynamic platform where users can generate images, compare them side-by-side, and vote for their preferred models. This platform simplifies the comparison process and provides a ranking system reflecting human preferences, offering a holistic evaluation of model capabilities. GenAI-Arena is the first platform with comprehensive evaluation capabilities across multiple properties, supporting tasks like text-to-image generation, text-guided image editing, and text-to-video generation.
**Design and Implementation**
GenAI-Arena is structured around three primary tasks: text-to-image generation, image editing, and text-to-video generation. Each task includes features such as anonymous side-by-side voting, a battle playground, a direct generation tab, and a leaderboard. The platform ensures fair**GenAI Arena: An Open Evaluation Platform for Generative Models**
Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhu Chen
University of Waterloo
{dongfu.jiang, m3ku, t291i, wenhuchen}@uwaterloo.ca
https://hf.co/spaces/TIGER-Lab/GenAI-Arena
**Abstract**
Generative AI has made significant strides in fields like image and video generation, driven by innovative algorithms, architectures, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the lack of trustworthy evaluation metrics. Current automatic assessments, such as FID, CLIP, and FVD, often fail to capture nuanced quality and user satisfaction. This paper introduces GenAI-Arena, an open platform for evaluating image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas: text-to-image generation, text-to-video generation, and image editing. Currently, 27 open-source generative models are covered. GenAI-Arena has been operational for four months, amassing over 6000 votes from the community. The platform includes an Elo rating system to rank models based on user preferences. To promote research in building model-based evaluation metrics, GenAI-Arena releases GenAI-Bench, a cleaned version of preference data for three tasks. The accuracy of existing multimodal models like GPT-4o in assessing visual content is low, achieving only 49.19% accuracy across the three tasks.
**Introduction**
Image and video generation technologies have seen rapid advancements, leading to widespread applications in various domains. However, navigating and assessing the performance of numerous models remains challenging. Traditional evaluation metrics like PSNR, SSIM, LPIPS, and FID offer specific insights but often fall short in providing a comprehensive assessment of overall model performance, especially in subjective qualities like aesthetics and user satisfaction.
To address these challenges, GenAI-Arena is designed to enable fair and interactive evaluation. It offers a dynamic platform where users can generate images, compare them side-by-side, and vote for their preferred models. This platform simplifies the comparison process and provides a ranking system reflecting human preferences, offering a holistic evaluation of model capabilities. GenAI-Arena is the first platform with comprehensive evaluation capabilities across multiple properties, supporting tasks like text-to-image generation, text-guided image editing, and text-to-video generation.
**Design and Implementation**
GenAI-Arena is structured around three primary tasks: text-to-image generation, image editing, and text-to-video generation. Each task includes features such as anonymous side-by-side voting, a battle playground, a direct generation tab, and a leaderboard. The platform ensures fair