ChatterBox: Multi-round Multimodal Referring and Grounding

ChatterBox: Multi-round Multimodal Referring and Grounding

24 Jan 2024 | Yunjie Tian*, Tianren Ma*, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye
The paper introduces a new task called Multi-round Multimodal Referring and Grounding (MRG), which involves engaging in multi-round dialogues where the agent must understand and respond to referring expressions and visual grounding requests. The authors establish a new benchmark named CB-300K, which includes image-text datasets and an evaluation metric that assesses both visual and linguistic understanding. They propose ChatterBox, a vision-language model designed to handle MRG tasks. ChatterBox uses a two-branch architecture: one branch processes language to understand logical questions, and the other branch extracts visual features and performs visual grounding. The model is trained on CB-300K and other external datasets, and its performance is evaluated on various tasks, including multi-round dialogue, single-round referring expression, and visual grounding. ChatterBox demonstrates superior performance compared to existing models, showing strong capabilities in multi-round reasoning, visual recognition, and context integration. The paper also provides detailed descriptions of the dataset construction, model architecture, and experimental setup, along with qualitative and quantitative results to support the claims.The paper introduces a new task called Multi-round Multimodal Referring and Grounding (MRG), which involves engaging in multi-round dialogues where the agent must understand and respond to referring expressions and visual grounding requests. The authors establish a new benchmark named CB-300K, which includes image-text datasets and an evaluation metric that assesses both visual and linguistic understanding. They propose ChatterBox, a vision-language model designed to handle MRG tasks. ChatterBox uses a two-branch architecture: one branch processes language to understand logical questions, and the other branch extracts visual features and performs visual grounding. The model is trained on CB-300K and other external datasets, and its performance is evaluated on various tasks, including multi-round dialogue, single-round referring expression, and visual grounding. ChatterBox demonstrates superior performance compared to existing models, showing strong capabilities in multi-round reasoning, visual recognition, and context integration. The paper also provides detailed descriptions of the dataset construction, model architecture, and experimental setup, along with qualitative and quantitative results to support the claims.
Reach us at info@study.space
[slides] ChatterBox%3A Multi-round Multimodal Referring and Grounding | StudySpace