17 Jun 2024 | Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna
TASK-ME-ANYTHING is a benchmark generation engine that creates customized benchmarks based on user needs. It maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. The system includes 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs to evaluate the perceptual capabilities of large multimodal language models (MLMs). TASK-ME-ANYTHING reveals that open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding. Each model has unique strengths and weaknesses, with larger models generally performing better, though exceptions exist. GPT4o struggles with recognizing rotating/moving objects and distinguishing colors.
The system allows users to specify a computation budget and uses algorithms to approximate model performance without invoking the MLM on each task instance. TASK-ME-ANYTHING includes a spatio-temporal scene graph representation for visual content and task generators that create input-output pairs for specific capabilities. It supports queries for the best-performing model, task instances, or taxonomy concepts, as well as on-budget results approximation methods. The system can generate over 750 million VQA task instances across 2D sticker images, 3D tabletop scenes, and real images/videos.
TASK-ME-ANYTHING evaluates 13 open-source MLMs over 1M task instances and 18 open-source/proprietary MLMs over 8,400 task instances. It identifies that open-source MLMs are strong in object and attribute recognition but struggle with counting, spatial, and temporal understanding. Larger models generally perform better, though exceptions exist. GPT4o struggles with recognizing rotating/moving objects and distinguishing colors. The system supports fine-grained queries, including top-K queries, threshold queries, model comparison queries, and model debugging queries. It also provides efficient query results approximation methods under budget constraints.
The system validates its accuracy by measuring human performance on tasks and evaluates approximation methods. It enables detailed analysis of models' strengths and weaknesses, including specific skills like object recognition, attribute recognition, and relation understanding. TASK-ME-ANYTHING helps users select the best model for their needs and identifies model weaknesses for improvement. It also supports the evaluation of models across different visual inputs and highlights the consistency of models' strengths and weaknesses. The system is versatile and scalable, allowing for the addition of new task generators, assets, and software to expand its taxonomy. It addresses limitations such as the focus on perceptual capabilities and the need for complex reasoning tasks in future versions.TASK-ME-ANYTHING is a benchmark generation engine that creates customized benchmarks based on user needs. It maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. The system includes 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs to evaluate the perceptual capabilities of large multimodal language models (MLMs). TASK-ME-ANYTHING reveals that open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding. Each model has unique strengths and weaknesses, with larger models generally performing better, though exceptions exist. GPT4o struggles with recognizing rotating/moving objects and distinguishing colors.
The system allows users to specify a computation budget and uses algorithms to approximate model performance without invoking the MLM on each task instance. TASK-ME-ANYTHING includes a spatio-temporal scene graph representation for visual content and task generators that create input-output pairs for specific capabilities. It supports queries for the best-performing model, task instances, or taxonomy concepts, as well as on-budget results approximation methods. The system can generate over 750 million VQA task instances across 2D sticker images, 3D tabletop scenes, and real images/videos.
TASK-ME-ANYTHING evaluates 13 open-source MLMs over 1M task instances and 18 open-source/proprietary MLMs over 8,400 task instances. It identifies that open-source MLMs are strong in object and attribute recognition but struggle with counting, spatial, and temporal understanding. Larger models generally perform better, though exceptions exist. GPT4o struggles with recognizing rotating/moving objects and distinguishing colors. The system supports fine-grained queries, including top-K queries, threshold queries, model comparison queries, and model debugging queries. It also provides efficient query results approximation methods under budget constraints.
The system validates its accuracy by measuring human performance on tasks and evaluates approximation methods. It enables detailed analysis of models' strengths and weaknesses, including specific skills like object recognition, attribute recognition, and relation understanding. TASK-ME-ANYTHING helps users select the best model for their needs and identifies model weaknesses for improvement. It also supports the evaluation of models across different visual inputs and highlights the consistency of models' strengths and weaknesses. The system is versatile and scalable, allowing for the addition of new task generators, assets, and software to expand its taxonomy. It addresses limitations such as the focus on perceptual capabilities and the need for complex reasoning tasks in future versions.