November 1-4, 2023 | Yejin Bang*, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Willie, Holy Lovenia, Ziwei Ji, Tizheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung*
This paper presents a comprehensive evaluation of ChatGPT's performance across multiple tasks, languages, and modalities. The evaluation uses 23 datasets covering 8 NLP tasks, including reasoning, hallucination, and interactivity. ChatGPT outperforms zero-shot learning models on most tasks and even surpasses fine-tuned models on some. It excels in understanding non-Latin scripts but struggles with generating them. ChatGPT can generate multimodal content from text via intermediate code generation. It achieves 63.41% accuracy in logical, non-textual, and commonsense reasoning tasks, indicating it is not a reliable reasoner. ChatGPT suffers from hallucination issues, similar to other LLMs. However, its interactive feature allows human collaboration to improve performance, such as 8% ROUGE-1 improvement in summarization and 2% ChrF++ in machine translation through multi-turn prompt engineering. The paper also evaluates ChatGPT's multilingual and multimodal abilities, finding that it performs well in high-resource languages but struggles with low-resource ones. It can generate code to bridge vision and language, though its multimodal ability is still elementary. ChatGPT's reasoning is weak in inductive tasks and lacks spatial and mathematical reasoning. It performs better in commonsense reasoning than non-textual semantic reasoning. Hallucination is a significant issue, with ChatGPT generating extrinsic hallucinations. Interactivity allows ChatGPT to perform multiple tasks in a dialog session, improving performance in tasks like summarization and machine translation. The paper provides a benchmark for ChatGPT, highlighting its strengths and limitations, and suggests that further research is needed to improve its reasoning and hallucination capabilities. The evaluation is conducted using publicly available datasets and provides insights into ChatGPT's performance in various NLP tasks.This paper presents a comprehensive evaluation of ChatGPT's performance across multiple tasks, languages, and modalities. The evaluation uses 23 datasets covering 8 NLP tasks, including reasoning, hallucination, and interactivity. ChatGPT outperforms zero-shot learning models on most tasks and even surpasses fine-tuned models on some. It excels in understanding non-Latin scripts but struggles with generating them. ChatGPT can generate multimodal content from text via intermediate code generation. It achieves 63.41% accuracy in logical, non-textual, and commonsense reasoning tasks, indicating it is not a reliable reasoner. ChatGPT suffers from hallucination issues, similar to other LLMs. However, its interactive feature allows human collaboration to improve performance, such as 8% ROUGE-1 improvement in summarization and 2% ChrF++ in machine translation through multi-turn prompt engineering. The paper also evaluates ChatGPT's multilingual and multimodal abilities, finding that it performs well in high-resource languages but struggles with low-resource ones. It can generate code to bridge vision and language, though its multimodal ability is still elementary. ChatGPT's reasoning is weak in inductive tasks and lacks spatial and mathematical reasoning. It performs better in commonsense reasoning than non-textual semantic reasoning. Hallucination is a significant issue, with ChatGPT generating extrinsic hallucinations. Interactivity allows ChatGPT to perform multiple tasks in a dialog session, improving performance in tasks like summarization and machine translation. The paper provides a benchmark for ChatGPT, highlighting its strengths and limitations, and suggests that further research is needed to improve its reasoning and hallucination capabilities. The evaluation is conducted using publicly available datasets and provides insights into ChatGPT's performance in various NLP tasks.