ChatterBox: Multi-round Multimodal Referring and Grounding

ChatterBox: Multi-round Multimodal Referring and Grounding

24 Jan 2024 | Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye
ChatterBox is a multi-round multimodal referring and grounding (MRG) model designed to understand and answer complex visual questions in a dialogue context. The model is trained on the CB-300K benchmark, which includes a diverse set of image-text data and is structured to support multi-round dialogues with logical connections between questions. The model uses a two-branch architecture, combining visual and language processing to enable accurate visual grounding and referential understanding. The visual branch extracts features from images, while the language branch processes text to generate answers. The model is optimized with a two-stage training process that incorporates both CB-300K data and auxiliary external data, enhancing its ability to handle instance-level understanding and multi-round dialogues. ChatterBox outperforms existing models in MRG tasks, demonstrating strong performance in both linguistic and visual grounding. The model is capable of handling complex interactions, including multi-round dialogues, visual grounding, and referring expressions. The CB-300K benchmark provides a comprehensive dataset for evaluating MRG tasks, including challenges such as multi-round dialogue, complex spatial relationships, and consistent reasoning. The model's design allows for flexible and efficient training, making it a promising solution for multimodal dialogue systems.ChatterBox is a multi-round multimodal referring and grounding (MRG) model designed to understand and answer complex visual questions in a dialogue context. The model is trained on the CB-300K benchmark, which includes a diverse set of image-text data and is structured to support multi-round dialogues with logical connections between questions. The model uses a two-branch architecture, combining visual and language processing to enable accurate visual grounding and referential understanding. The visual branch extracts features from images, while the language branch processes text to generate answers. The model is optimized with a two-stage training process that incorporates both CB-300K data and auxiliary external data, enhancing its ability to handle instance-level understanding and multi-round dialogues. ChatterBox outperforms existing models in MRG tasks, demonstrating strong performance in both linguistic and visual grounding. The model is capable of handling complex interactions, including multi-round dialogues, visual grounding, and referring expressions. The CB-300K benchmark provides a comprehensive dataset for evaluating MRG tasks, including challenges such as multi-round dialogue, complex spatial relationships, and consistent reasoning. The model's design allows for flexible and efficient training, making it a promising solution for multimodal dialogue systems.
Reach us at info@study.space