Understanding Benchmarking Large Multimodal Models against Common Corruptions

This technical report aims to address the lack of comprehensive evaluation of large multimodal models (LMMs) under common corruptions by examining their self-consistency in generating outputs across text, image, and speech modalities. The report introduces MMCbench, a benchmark that evaluates over 100 popular LMMs across four key generation tasks: text-to-image, image-to-text, text-to-speech, and speech-to-text. The evaluation covers more than 100 LMMs with over 150 model checkpoints, focusing on cross-modal interactions and robustness to various corruptions. The benchmarking process involves selecting representative examples from large datasets like LAION and Common Voice, and then subjecting them to common corruptions before feeding them into the models. The evaluation measures self-consistency through cross-modal similarity or consistency within the output modality, depending on the availability of cross-modal models. The report provides detailed methodologies for text-to-image, image-to-text, text-to-speech, and speech-to-text tasks, including the selection of corruptions, data selection strategies, and evaluation metrics. The results highlight the performance of different models under various corruption levels and selection methods, emphasizing the importance of robustness in practical applications. The report concludes with a discussion on the significance of self-consistency in LMMs and the potential for future improvements and updates to the benchmark.This technical report aims to address the lack of comprehensive evaluation of large multimodal models (LMMs) under common corruptions by examining their self-consistency in generating outputs across text, image, and speech modalities. The report introduces MMCbench, a benchmark that evaluates over 100 popular LMMs across four key generation tasks: text-to-image, image-to-text, text-to-speech, and speech-to-text. The evaluation covers more than 100 LMMs with over 150 model checkpoints, focusing on cross-modal interactions and robustness to various corruptions. The benchmarking process involves selecting representative examples from large datasets like LAION and Common Voice, and then subjecting them to common corruptions before feeding them into the models. The evaluation measures self-consistency through cross-modal similarity or consistency within the output modality, depending on the availability of cross-modal models. The report provides detailed methodologies for text-to-image, image-to-text, text-to-speech, and speech-to-text tasks, including the selection of corruptions, data selection strategies, and evaluation metrics. The results highlight the performance of different models under various corruption levels and selection methods, emphasizing the importance of robustness in practical applications. The report concludes with a discussion on the significance of self-consistency in LMMs and the potential for future improvements and updates to the benchmark.

Benchmarking Large Multimodal Models against Common Corruptions

22 Jan 2024 | Jiawei Zhang, Tianyu Pang, Chao Du, Yi Ren, Bo Li, Min Lin