UMIE: Unified Multimodal Information Extraction with Instruction Tuning

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

5 Jan 2024 | Lin Sun*, Kai Zhang*, Qingyuan Li*, Renze Lou
The paper introduces UMIE (Unified Multimodal Information Extractor), a unified model designed to handle three multimodal information extraction (MIE) tasks: multimodal named entity recognition (MNER), multimodal relation extraction (MRE), and multimodal event extraction (MEE). UMIE leverages instruction tuning to perform these tasks as generation problems, enabling it to extract both textual and visual mentions. The model consists of a text encoder, a visual encoder, a gated attention module, and a text decoder. The gated attention module dynamically integrates visual features with textual features, enhancing cross-modal information extraction. Extensive experiments on six MIE datasets show that UMIE outperforms state-of-the-art (SoTA) methods across all tasks, demonstrating strong generalization and robustness to instruction variations. The model also exhibits excellent zero-shot performance, outperforming LLMs like ChatGPT and GPT-4. The paper highlights the effectiveness of the proposed model in handling various MIE tasks and its potential for future research in unified MIE and instruction tuning.The paper introduces UMIE (Unified Multimodal Information Extractor), a unified model designed to handle three multimodal information extraction (MIE) tasks: multimodal named entity recognition (MNER), multimodal relation extraction (MRE), and multimodal event extraction (MEE). UMIE leverages instruction tuning to perform these tasks as generation problems, enabling it to extract both textual and visual mentions. The model consists of a text encoder, a visual encoder, a gated attention module, and a text decoder. The gated attention module dynamically integrates visual features with textual features, enhancing cross-modal information extraction. Extensive experiments on six MIE datasets show that UMIE outperforms state-of-the-art (SoTA) methods across all tasks, demonstrating strong generalization and robustness to instruction variations. The model also exhibits excellent zero-shot performance, outperforming LLMs like ChatGPT and GPT-4. The paper highlights the effectiveness of the proposed model in handling various MIE tasks and its potential for future research in unified MIE and instruction tuning.
Reach us at info@study.space