17 Jun 2024 | Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna
The paper introduces TASK-ME-ANYTHING, a benchmark generation engine designed to address user queries with specific evaluation objectives. TASK-ME-ANYTHING maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. It also includes algorithms to efficiently approximate model performance within a computational budget. The system contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships, capable of generating 750M image/video question-answering pairs. The paper evaluates 13 open-source and 18 open-source/proprietary large multimodal language models (MLMs) using TASK-ME-ANYTHING, revealing insights into their perceptual capabilities, strengths, and weaknesses. Key findings include:
1. Open-source MLMs excel in object and attribute recognition but struggle with spatial and temporal understanding.
2. Each model exhibits unique strengths and weaknesses.
3. Larger models generally perform better, though exceptions exist.
4. GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors.
The paper also discusses the limitations of programmatically generated tasks, potential negative social impacts, and future work directions.The paper introduces TASK-ME-ANYTHING, a benchmark generation engine designed to address user queries with specific evaluation objectives. TASK-ME-ANYTHING maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. It also includes algorithms to efficiently approximate model performance within a computational budget. The system contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships, capable of generating 750M image/video question-answering pairs. The paper evaluates 13 open-source and 18 open-source/proprietary large multimodal language models (MLMs) using TASK-ME-ANYTHING, revealing insights into their perceptual capabilities, strengths, and weaknesses. Key findings include:
1. Open-source MLMs excel in object and attribute recognition but struggle with spatial and temporal understanding.
2. Each model exhibits unique strengths and weaknesses.
3. Larger models generally perform better, though exceptions exist.
4. GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors.
The paper also discusses the limitations of programmatically generated tasks, potential negative social impacts, and future work directions.