IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

18 Aug 2024 | Deqing Fu, Ruohao Guo, Ghazal Khalighnejad, Ollie Liu, Bhuvan Dhingra, Dani Yogatama, Robin Jia, Willie Neiswanger
**IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations** **Authors:** Deqing Fu, Robin Jia, Vatsal Sharan, Olga Golovneva **Affiliations:** USC, Georgia Tech, Duke **Abstract:** This paper introduces IsoBench, a benchmark dataset designed to evaluate the performance of multimodal foundation models on problems with multiple isomorphic representations. The dataset includes four major areas: math, science, algorithms, and games, with over 1,887 examples. Each example is presented with multiple input representations, such as visual, textual, and mathematical, to assess how models handle different modalities. The study finds that multimodal models generally perform better with textual representations compared to visual ones, contrary to human preferences. To address this discrepancy, the authors propose IsoCombination and IsoScratchPad, techniques that combine multiple input modalities or translate visual inputs into textual representations. These techniques improve model performance, with IsoCB enhancing graph algorithm problems by up to 9.4 points and IsoSP improving science problems by up to 14.4 points. **Key Findings:** - **Performance Discrepancies:** Multimodal models consistently prefer textual representations over visual ones, with significant performance gaps observed across various tasks. - **Techniques to Improve Performance:** IsoCombination and IsoScratchPad effectively bridge the performance gap between different input modalities, improving model performance in specific domains. **Contributions:** - **IsoBench Dataset:** A comprehensive benchmark dataset with 1,887 examples across four domains, providing fine-grained feedback on performance gaps caused by input modality. - **Model Evaluation:** Benchmarking of popular multimodal foundation models (GPT-4, Gemini, Claude-3) on IsoBench, revealing consistent preference for textual representations. - **Techniques to Enhance Performance:** Introduction of IsoCombination and IsoScratchPad, which improve model performance by combining or translating input modalities. **Conclusion:** IsoBench provides a detailed and fine-grained analysis of how multimodal foundation models handle different input modalities, highlighting the importance of textual representations. The proposed techniques offer practical solutions to improve model performance in multimodal tasks.**IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations** **Authors:** Deqing Fu, Robin Jia, Vatsal Sharan, Olga Golovneva **Affiliations:** USC, Georgia Tech, Duke **Abstract:** This paper introduces IsoBench, a benchmark dataset designed to evaluate the performance of multimodal foundation models on problems with multiple isomorphic representations. The dataset includes four major areas: math, science, algorithms, and games, with over 1,887 examples. Each example is presented with multiple input representations, such as visual, textual, and mathematical, to assess how models handle different modalities. The study finds that multimodal models generally perform better with textual representations compared to visual ones, contrary to human preferences. To address this discrepancy, the authors propose IsoCombination and IsoScratchPad, techniques that combine multiple input modalities or translate visual inputs into textual representations. These techniques improve model performance, with IsoCB enhancing graph algorithm problems by up to 9.4 points and IsoSP improving science problems by up to 14.4 points. **Key Findings:** - **Performance Discrepancies:** Multimodal models consistently prefer textual representations over visual ones, with significant performance gaps observed across various tasks. - **Techniques to Improve Performance:** IsoCombination and IsoScratchPad effectively bridge the performance gap between different input modalities, improving model performance in specific domains. **Contributions:** - **IsoBench Dataset:** A comprehensive benchmark dataset with 1,887 examples across four domains, providing fine-grained feedback on performance gaps caused by input modality. - **Model Evaluation:** Benchmarking of popular multimodal foundation models (GPT-4, Gemini, Claude-3) on IsoBench, revealing consistent preference for textual representations. - **Techniques to Enhance Performance:** Introduction of IsoCombination and IsoScratchPad, which improve model performance by combining or translating input modalities. **Conclusion:** IsoBench provides a detailed and fine-grained analysis of how multimodal foundation models handle different input modalities, highlighting the importance of textual representations. The proposed techniques offer practical solutions to improve model performance in multimodal tasks.
Reach us at info@study.space