Understanding MIKE%3A A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing

The paper introduces MIKE, a comprehensive benchmark and dataset designed for fine-grained (FG) multimodal entity knowledge editing in Multimodal Large Language Models (MLLMs). Unlike existing benchmarks that focus on coarse-grained knowledge, MIKE specifically addresses the challenges of FG entity recognition and editing, which are crucial for practical applications. The benchmark includes three main tasks: Vanilla Name Answering (VNA), Entity-Level Caption (ELC), and Complex-Scenario Recognition (CSR), each tailored to assess different aspects of MLLMs' performance. Additionally, a new form of knowledge editing called Multi-Step Editing is introduced to evaluate the efficiency of editing methods. Extensive experiments using two MLLMs, BLIP-2 and MiniGPT-4, demonstrate that current state-of-the-art methods face significant challenges in handling the FG knowledge editing tasks, highlighting the need for novel approaches in this domain. The findings suggest that the Entity-Level Caption task is the most challenging, and that different generality tasks affect the ability of MIKE in various ways. The paper also discusses the impact of model size and image augmentations on performance, concluding with a call for future research to address these limitations and extend the benchmark.The paper introduces MIKE, a comprehensive benchmark and dataset designed for fine-grained (FG) multimodal entity knowledge editing in Multimodal Large Language Models (MLLMs). Unlike existing benchmarks that focus on coarse-grained knowledge, MIKE specifically addresses the challenges of FG entity recognition and editing, which are crucial for practical applications. The benchmark includes three main tasks: Vanilla Name Answering (VNA), Entity-Level Caption (ELC), and Complex-Scenario Recognition (CSR), each tailored to assess different aspects of MLLMs' performance. Additionally, a new form of knowledge editing called Multi-Step Editing is introduced to evaluate the efficiency of editing methods. Extensive experiments using two MLLMs, BLIP-2 and MiniGPT-4, demonstrate that current state-of-the-art methods face significant challenges in handling the FG knowledge editing tasks, highlighting the need for novel approaches in this domain. The findings suggest that the Entity-Level Caption task is the most challenging, and that different generality tasks affect the ability of MIKE in various ways. The paper also discusses the impact of model size and image augmentations on performance, concluding with a call for future research to address these limitations and extend the benchmark.

MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing

18 Feb 2024 | Jiaqi Li1,3, Miaozeng Du1,3, Chuanyi Zhang2, Yongrui Chen1,3, Nan Hu1,3, Guilin Qi1,3, Haiyun Jiang4, Siyuan Cheng5, Bozhong Tian5