Understanding MM-Soc%3A Benchmarking Multimodal Large Language Models in Social Media Platforms

**Abstract:** Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to these challenges, but they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through an exhaustive evaluation on ten size-variants of four open-source MLLMs, significant performance disparities are identified, highlighting the need for advancements in models' social understanding capabilities. The analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. The code and data are available at <https://github.com/claws-lab/MM-Soc.git>. Social media platforms have become the epicenter of multimodal information exchange, blending various formats of content such as text, images, and videos. These platforms serve not only as channels for sharing news and personal experiences but also for spreading rumors and shaping public opinion. Multimodal Large Language Models (MLLMs) have recently emerged as powerful tools for bridging the understanding of natural language and visual cues, showcasing their potential in a range of tasks. However, the complexity of tasks such as understanding human emotions, memes, and verifying misinformation presents significant evaluation challenges to MLLMs. These tasks require not only combining signals extracted from both textual and visual domains but also considering various social contexts upon making decisions regarding contextual appropriateness or correctness. MM-Soc is a novel benchmark designed to rigorously assess the capabilities of MLLMs across diverse tasks typical of social media environments. It includes 10 multimodal tasks, including 7 image-text classification tasks, 2 generative tasks, and a text extraction task. The benchmark targets open-source MLLMs, recognizing their advantages in terms of rapid deployment, reduced operational costs, and superior capacity for maintaining data integrity. Through MM-Soc, a thorough and systematic examination of MLLMs is conducted, exploring and validating new methodologies to augment MLLM efficacy in handling multimodal tasks. The results highlight the current limitations of MLLMs and provide insights into future research directions. The deployment of Multimodal Large Language Models (MLLMs) as general-purpose assistants across social networks marks a significant shift from traditional, specialized models designed for singular tasks. This transition necessitates a comprehensive skill set enabling these models to navigate the multifaceted challenges presented by user-generated content. MM-Soc spans both natural language understanding and generation tasks, designed to test the models' abilities to interact with user-generated content encountered online. The selection includes binary**Abstract:** Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to these challenges, but they struggle to accurately interpret human emotions and complex content such as misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs' understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through an exhaustive evaluation on ten size-variants of four open-source MLLMs, significant performance disparities are identified, highlighting the need for advancements in models' social understanding capabilities. The analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. The code and data are available at <https://github.com/claws-lab/MM-Soc.git>. Social media platforms have become the epicenter of multimodal information exchange, blending various formats of content such as text, images, and videos. These platforms serve not only as channels for sharing news and personal experiences but also for spreading rumors and shaping public opinion. Multimodal Large Language Models (MLLMs) have recently emerged as powerful tools for bridging the understanding of natural language and visual cues, showcasing their potential in a range of tasks. However, the complexity of tasks such as understanding human emotions, memes, and verifying misinformation presents significant evaluation challenges to MLLMs. These tasks require not only combining signals extracted from both textual and visual domains but also considering various social contexts upon making decisions regarding contextual appropriateness or correctness. MM-Soc is a novel benchmark designed to rigorously assess the capabilities of MLLMs across diverse tasks typical of social media environments. It includes 10 multimodal tasks, including 7 image-text classification tasks, 2 generative tasks, and a text extraction task. The benchmark targets open-source MLLMs, recognizing their advantages in terms of rapid deployment, reduced operational costs, and superior capacity for maintaining data integrity. Through MM-Soc, a thorough and systematic examination of MLLMs is conducted, exploring and validating new methodologies to augment MLLM efficacy in handling multimodal tasks. The results highlight the current limitations of MLLMs and provide insights into future research directions. The deployment of Multimodal Large Language Models (MLLMs) as general-purpose assistants across social networks marks a significant shift from traditional, specialized models designed for singular tasks. This transition necessitates a comprehensive skill set enabling these models to navigate the multifaceted challenges presented by user-generated content. MM-Soc spans both natural language understanding and generation tasks, designed to test the models' abilities to interact with user-generated content encountered online. The selection includes binary

MM-SOC: Benchmarking Multimodal Large Language Models in Social Media Platforms

24 Jul 2024 | Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, Srijan Kumar