Benchmarking Large Multimodal Models against Common Corruptions

Benchmarking Large Multimodal Models against Common Corruptions

22 Jan 2024 | Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin
This technical report introduces MMCBench, a comprehensive benchmark for evaluating the self-consistency of large multimodal models (LMMs) under common corruptions. The benchmark covers over 100 popular LMMs across more than 150 model checkpoints, focusing on four key generative tasks: text-to-image, image-to-text, text-to-speech, and speech-to-text. The evaluation involves applying various common corruptions to input modalities and measuring the consistency of the generated outputs. The benchmarking code is available at https://github.com/sail-sg/MMCBench. The study investigates the cross-modal interactions between text, image, and speech, emphasizing the robustness and self-consistency of LMMs under different corruption scenarios. The benchmark includes 23 text corruptions, 29 image corruptions, and 16 speech corruptions. For text-to-image generation, 27 models across 37 checkpoints are evaluated; for image-to-text, 39 models across 58 checkpoints; for text-to-speech, 14 models across 15 checkpoints; and for speech-to-text, 41 models across 47 checkpoints. The evaluation methodology involves calculating the average cosine similarity between original captions and generated images, or between original transcriptions and generated transcriptions, under different corruption conditions. The results show that models with larger LLMs do not necessarily achieve higher consistency, and some smaller models can perform as well or better than larger ones. The study also highlights the importance of text as a semantic anchor for evaluating cross-modal consistency, and the need for a comprehensive benchmark to assess the reliability of LMMs in practical applications. The findings contribute to the broader field of multimodal model evaluation by providing a detailed analysis of the resilience of various models under different corruption scenarios.This technical report introduces MMCBench, a comprehensive benchmark for evaluating the self-consistency of large multimodal models (LMMs) under common corruptions. The benchmark covers over 100 popular LMMs across more than 150 model checkpoints, focusing on four key generative tasks: text-to-image, image-to-text, text-to-speech, and speech-to-text. The evaluation involves applying various common corruptions to input modalities and measuring the consistency of the generated outputs. The benchmarking code is available at https://github.com/sail-sg/MMCBench. The study investigates the cross-modal interactions between text, image, and speech, emphasizing the robustness and self-consistency of LMMs under different corruption scenarios. The benchmark includes 23 text corruptions, 29 image corruptions, and 16 speech corruptions. For text-to-image generation, 27 models across 37 checkpoints are evaluated; for image-to-text, 39 models across 58 checkpoints; for text-to-speech, 14 models across 15 checkpoints; and for speech-to-text, 41 models across 47 checkpoints. The evaluation methodology involves calculating the average cosine similarity between original captions and generated images, or between original transcriptions and generated transcriptions, under different corruption conditions. The results show that models with larger LLMs do not necessarily achieve higher consistency, and some smaller models can perform as well or better than larger ones. The study also highlights the importance of text as a semantic anchor for evaluating cross-modal consistency, and the need for a comprehensive benchmark to assess the reliability of LMMs in practical applications. The findings contribute to the broader field of multimodal model evaluation by providing a detailed analysis of the resilience of various models under different corruption scenarios.
Reach us at info@study.space
[slides and audio] Benchmarking Large Multimodal Models against Common Corruptions