1 Feb 2024 | Yilun Zhu, Joel Ruben Antony Moniz, Shruti Bhargava, Jiarui Lu, Dhivya Piraviperumal, Site Li, Yuan Zhang, Hong Yu, Bo-Hsiang Tseng
Can Large Language Models Understand Context?
Yilun Zhu, Joel Ruben Antony Moniz, Shruti Bhargava, Jiarui Lu, Dhivya Piraviperumal, Site Li, Yuan Zhang, Hong Yu, Bo-Hsiang Tseng
Abstract: Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises four distinct tasks and nine datasets, all featuring prompts designed to assess the models' ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.
Introduction: Discourse understanding, as one of the fundamental problems in NLP, focuses on modeling linguistic features and structures that go beyond individual sentences. Understanding discourse requires resolving the relations between words/phrases (coreference resolution) and discourse units (discourse parsing and discourse relation classification) in the previous context, identifying carry-over information for the following context (dialogue state tracking), and recognizing discourse-specific phenomena (ellipsis).
LLMs have garnered substantial attention from both academia and the industry due to their remarkable capability in comprehending language and world knowledge. Their unparalleled performance across a diverse range of benchmarks and datasets has firmly established their significance in a relatively short period of time. As LLMs continue to push the boundaries of scale and capability, the evaluation of their multifaceted abilities becomes an equally vital endeavor. Consequently, the development of robust evaluation methodologies to assess specific aspects of LLMs becomes imperative. In addition, these methodologies should focus on helping achieve a comprehensive understanding of their advancement while clearly delineating their limitations. However, recently published LLMs, such as OPT, LLaMA, and GPT-4, are only evaluated on limited benchmarks, and have a significant drawback: they neglect the inclusion of discourse-related datasets for evaluation, thereby limiting the comprehensive assessment of their language understanding capabilities.
To provide a comprehensive evaluation, plenty of benchmarks and datasets address various facets of language understanding, including benchmarks that delve into common sense knowledge, as well as linguistic capabilities like sentiment analysis, natural language inference, summarization, text classification, and more. These general benchmarks and specific dataset evaluations exhibitCan Large Language Models Understand Context?
Yilun Zhu, Joel Ruben Antony Moniz, Shruti Bhargava, Jiarui Lu, Dhivya Piraviperumal, Site Li, Yuan Zhang, Hong Yu, Bo-Hsiang Tseng
Abstract: Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises four distinct tasks and nine datasets, all featuring prompts designed to assess the models' ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.
Introduction: Discourse understanding, as one of the fundamental problems in NLP, focuses on modeling linguistic features and structures that go beyond individual sentences. Understanding discourse requires resolving the relations between words/phrases (coreference resolution) and discourse units (discourse parsing and discourse relation classification) in the previous context, identifying carry-over information for the following context (dialogue state tracking), and recognizing discourse-specific phenomena (ellipsis).
LLMs have garnered substantial attention from both academia and the industry due to their remarkable capability in comprehending language and world knowledge. Their unparalleled performance across a diverse range of benchmarks and datasets has firmly established their significance in a relatively short period of time. As LLMs continue to push the boundaries of scale and capability, the evaluation of their multifaceted abilities becomes an equally vital endeavor. Consequently, the development of robust evaluation methodologies to assess specific aspects of LLMs becomes imperative. In addition, these methodologies should focus on helping achieve a comprehensive understanding of their advancement while clearly delineating their limitations. However, recently published LLMs, such as OPT, LLaMA, and GPT-4, are only evaluated on limited benchmarks, and have a significant drawback: they neglect the inclusion of discourse-related datasets for evaluation, thereby limiting the comprehensive assessment of their language understanding capabilities.
To provide a comprehensive evaluation, plenty of benchmarks and datasets address various facets of language understanding, including benchmarks that delve into common sense knowledge, as well as linguistic capabilities like sentiment analysis, natural language inference, summarization, text classification, and more. These general benchmarks and specific dataset evaluations exhibit