UMIE: Unified Multimodal Information Extraction with Instruction Tuning

UMIE: Unified Multimodal Information Extraction with Instruction Tuning

5 Jan 2024 | Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou
UMIE: Unified Multimodal Information Extraction with Instruction Tuning Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou Abstract: Multimodal information extraction (MIE) has gained significant attention as multimedia content becomes more popular. However, current MIE methods often use task-specific model structures, leading to limited generalizability and underutilization of shared knowledge across tasks. To address these issues, we propose UMIE, a unified multimodal information extractor that unifies three MIE tasks as generation problems using instruction tuning. Extensive experiments show that our single UMIE outperforms various state-of-the-art methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates exploration into both instruction tuning and large language models within the MIE domain. Introduction: Information extraction (IE) plays a crucial role in natural language processing. As multimedia content becomes more popular, multimodal information extraction (MIE) has drawn significant attention. MIE aims to deliver structured information of interest from multiple media sources such as textual, visual, and potentially more. It is considered a challenging task due to the inherent complexity of media formats and the necessity to bridge cross-modal gaps. MIE includes multimodal named entity recognition (MNER), multimodal relation extraction (MRE), and multimodal event extraction (MEE). Current methods for MIE usually focus on a specific task. Unified Multimodal Information Extractor: The Unified Multimodal Information Extractor (UMIE) consists of four major modules: 1) Text encoder for instruction-following and text comprehension; 2) Visual encoder for visual representation; 3) Gated attention for cross-modal representation; 4) Text decoder for information extraction. The UMIE model utilizes a transformer-based encoder-decoder architecture to perform MIE and generate structured outputs in an auto-regressive fashion. For the textual input prefixed with a task instructor, we use a text encoder to generate text representations. For the image, we equip it with visual understanding abilities via our proposed visual encoder and gated attention mechanism for dynamic visual clue integration. Finally, we employ a text decoder to generate the structural results for MIE tasks. Experiments: We train the UMIE with instruction tuning on various MIE datasets and evaluate the model in both supervised learning and zero-shot settings. In addition, we evaluate the robustness of instruction following of UMIE and showcase the unified extraction abilities of our model. Main Results: Table 4 shows the comparison in detail. Our UMIE achieves on-par or significantly better performances compared to the previous baselines. In particular, the UMIE-XL achieves the SoTA performance on all six datasets, when compared to the best model in each dataset. UMIE marginally outperforms previous SoUMIE: Unified Multimodal Information Extraction with Instruction Tuning Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou Abstract: Multimodal information extraction (MIE) has gained significant attention as multimedia content becomes more popular. However, current MIE methods often use task-specific model structures, leading to limited generalizability and underutilization of shared knowledge across tasks. To address these issues, we propose UMIE, a unified multimodal information extractor that unifies three MIE tasks as generation problems using instruction tuning. Extensive experiments show that our single UMIE outperforms various state-of-the-art methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates exploration into both instruction tuning and large language models within the MIE domain. Introduction: Information extraction (IE) plays a crucial role in natural language processing. As multimedia content becomes more popular, multimodal information extraction (MIE) has drawn significant attention. MIE aims to deliver structured information of interest from multiple media sources such as textual, visual, and potentially more. It is considered a challenging task due to the inherent complexity of media formats and the necessity to bridge cross-modal gaps. MIE includes multimodal named entity recognition (MNER), multimodal relation extraction (MRE), and multimodal event extraction (MEE). Current methods for MIE usually focus on a specific task. Unified Multimodal Information Extractor: The Unified Multimodal Information Extractor (UMIE) consists of four major modules: 1) Text encoder for instruction-following and text comprehension; 2) Visual encoder for visual representation; 3) Gated attention for cross-modal representation; 4) Text decoder for information extraction. The UMIE model utilizes a transformer-based encoder-decoder architecture to perform MIE and generate structured outputs in an auto-regressive fashion. For the textual input prefixed with a task instructor, we use a text encoder to generate text representations. For the image, we equip it with visual understanding abilities via our proposed visual encoder and gated attention mechanism for dynamic visual clue integration. Finally, we employ a text decoder to generate the structural results for MIE tasks. Experiments: We train the UMIE with instruction tuning on various MIE datasets and evaluate the model in both supervised learning and zero-shot settings. In addition, we evaluate the robustness of instruction following of UMIE and showcase the unified extraction abilities of our model. Main Results: Table 4 shows the comparison in detail. Our UMIE achieves on-par or significantly better performances compared to the previous baselines. In particular, the UMIE-XL achieves the SoTA performance on all six datasets, when compared to the best model in each dataset. UMIE marginally outperforms previous So
Reach us at info@study.space
[slides] UMIE%3A Unified Multimodal Information Extraction with Instruction Tuning | StudySpace