Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models

Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models

May 13-17, 2024 | Hongzhan Lin, Ziyang Luo, Wei Gao, Jing Ma, Bo Wang, Ruichao Yang
This paper proposes an explainable approach for harmful meme detection through multimodal debate between large language models (LLMs). The goal is to not only detect harmful memes but also provide explanations for the detection decisions. The approach involves generating conflicting rationales from both harmless and harmful perspectives through a multimodal debate between LLMs. These rationales are then used to fine-tune a small language model as a judge to infer harmfulness, enabling multimodal fusion between the harmfulness rationales and the intrinsic multimodal information within memes. This allows the model to perform dialectical reasoning over intricate and implicit harm-indicative patterns, utilizing multimodal explanations from both harmless and harmful arguments. The method is evaluated on three public meme datasets, demonstrating that the proposed approach achieves better performance than state-of-the-art methods and provides informative explanations for harmfulness predictions. The model's ability to generate explanations is crucial for content moderation on social media, as both moderators and users may want to understand the harmful content behind a flagged meme. The approach leverages the powerful text generation capacity of LLMs via Chain-of-Thought (CoT) prompting to generate rationales from both harmless and harmful perspectives. These rationales are then used to fine-tune a small language model as a judge for harmfulness prediction, aligning the multimodal features between the meme and the harmfulness rationales. The model's performance is evaluated on three datasets, showing that it outperforms existing methods in terms of accuracy and macro-averaged F1 score. The results also indicate that the model provides informative explanations for harmfulness predictions, which is essential for content moderation. The approach is further validated through ablation studies, showing that the multimodal debate and fusion mechanisms are crucial for effective harmful meme detection. The model's ability to generate explanations is also validated through automatic and human evaluations, demonstrating that the explanations are informative, concise, and persuasive. The results show that the proposed approach is effective and can be used to develop a universal framework for harmful meme detection and explanation.This paper proposes an explainable approach for harmful meme detection through multimodal debate between large language models (LLMs). The goal is to not only detect harmful memes but also provide explanations for the detection decisions. The approach involves generating conflicting rationales from both harmless and harmful perspectives through a multimodal debate between LLMs. These rationales are then used to fine-tune a small language model as a judge to infer harmfulness, enabling multimodal fusion between the harmfulness rationales and the intrinsic multimodal information within memes. This allows the model to perform dialectical reasoning over intricate and implicit harm-indicative patterns, utilizing multimodal explanations from both harmless and harmful arguments. The method is evaluated on three public meme datasets, demonstrating that the proposed approach achieves better performance than state-of-the-art methods and provides informative explanations for harmfulness predictions. The model's ability to generate explanations is crucial for content moderation on social media, as both moderators and users may want to understand the harmful content behind a flagged meme. The approach leverages the powerful text generation capacity of LLMs via Chain-of-Thought (CoT) prompting to generate rationales from both harmless and harmful perspectives. These rationales are then used to fine-tune a small language model as a judge for harmfulness prediction, aligning the multimodal features between the meme and the harmfulness rationales. The model's performance is evaluated on three datasets, showing that it outperforms existing methods in terms of accuracy and macro-averaged F1 score. The results also indicate that the model provides informative explanations for harmfulness predictions, which is essential for content moderation. The approach is further validated through ablation studies, showing that the multimodal debate and fusion mechanisms are crucial for effective harmful meme detection. The model's ability to generate explanations is also validated through automatic and human evaluations, demonstrating that the explanations are informative, concise, and persuasive. The results show that the proposed approach is effective and can be used to develop a universal framework for harmful meme detection and explanation.
Reach us at info@study.space
[slides and audio] Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models