[slides] A Multitask%2C Multilingual%2C Multimodal Evaluation of ChatGPT on Reasoning%2C Hallucination%2C and Interactivity

This paper presents a comprehensive framework for evaluating the interactive Large Language Model (LLM) ChatGPT using publicly available datasets. The evaluation covers 23 datasets across 8 common NLP tasks, assessing ChatGPT's multitask, multilingual, and multimodal capabilities. Key findings include: 1. **Multitask Performance**: ChatGPT outperforms LLMs with zero-shot learning on most tasks and even surpasses fine-tuned models on some tasks. 2. **Multilingual Performance**: ChatGPT performs better with Latin script languages than non-Latin script languages, showing limitations in low-resource languages. 3. **Multimodal Capabilities**: ChatGPT can generate multimodal content through an intermediate code generation step, demonstrating potential in vision-language tasks. 4. **Reasoning**: ChatGPT shows weaknesses in inductive reasoning and spatial and mathematical reasoning but performs better in temporal reasoning and commonsense reasoning. 5. **Hallucination**: ChatGPT suffers from hallucination problems, generating extrinsic and intrinsic hallucinations. 6. **Interactivity**: Multi-turn interactivity improves performance in tasks like summarization and machine translation, achieving 8% ROUGE-1 and 2% ChrF++ gains. The paper also discusses the ethical implications of ChatGPT and provides insights into its strengths and limitations, aiming to guide both researchers and users in understanding and leveraging ChatGPT effectively.This paper presents a comprehensive framework for evaluating the interactive Large Language Model (LLM) ChatGPT using publicly available datasets. The evaluation covers 23 datasets across 8 common NLP tasks, assessing ChatGPT's multitask, multilingual, and multimodal capabilities. Key findings include: 1. **Multitask Performance**: ChatGPT outperforms LLMs with zero-shot learning on most tasks and even surpasses fine-tuned models on some tasks. 2. **Multilingual Performance**: ChatGPT performs better with Latin script languages than non-Latin script languages, showing limitations in low-resource languages. 3. **Multimodal Capabilities**: ChatGPT can generate multimodal content through an intermediate code generation step, demonstrating potential in vision-language tasks. 4. **Reasoning**: ChatGPT shows weaknesses in inductive reasoning and spatial and mathematical reasoning but performs better in temporal reasoning and commonsense reasoning. 5. **Hallucination**: ChatGPT suffers from hallucination problems, generating extrinsic and intrinsic hallucinations. 6. **Interactivity**: Multi-turn interactivity improves performance in tasks like summarization and machine translation, achieving 8% ROUGE-1 and 2% ChrF++ gains. The paper also discusses the ethical implications of ChatGPT and provides insights into its strengths and limitations, aiming to guide both researchers and users in understanding and leveraging ChatGPT effectively.

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

November 1–4, 2023 | Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

November 1–4, 2023 | Yejin Bang*, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung*

November 1–4, 2023 | Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung