1 Feb 2024 | Yilun Zhu1*, Joel Ruben Antony Moniz2, Shruti Bhargava2, Jiarui Lu2 Dhivya Piraviperumal2, Site Li2, Yuan Zhang2, Hong Yu2, Bo-Hsiang Tseng2
This paper introduces a context understanding benchmark to evaluate the linguistic capabilities of Large Language Models (LLMs) in comprehending contextual features. The benchmark consists of four tasks and nine datasets, designed to assess LLMs' ability to understand context through in-context learning (ICL). The study evaluates LLMs of varying sizes and models subjected to compression techniques, including 3-bit post-training quantization. Key findings include:
1. **LLM Performance under ICL**: Pre-trained dense models struggle with nuanced contextual features compared to state-of-the-art fine-tuned models.
2. **Model Compression Impact**: 3-bit post-training quantization leads to varying degrees of performance reduction on the benchmark, with some tasks showing marginal performance drops and others experiencing significant reductions.
3. **Task-Specific Analysis**: The paper provides detailed analysis of each task, including coreference resolution, dialogue state tracking, implicit discourse relation classification, and query rewriting, highlighting the challenges and improvements across different model sizes and compression techniques.
The study concludes that while larger models generally perform better, the impact of quantization on context understanding is mixed, and further research is needed to understand the limitations and potential of LLMs in handling complex linguistic contexts.This paper introduces a context understanding benchmark to evaluate the linguistic capabilities of Large Language Models (LLMs) in comprehending contextual features. The benchmark consists of four tasks and nine datasets, designed to assess LLMs' ability to understand context through in-context learning (ICL). The study evaluates LLMs of varying sizes and models subjected to compression techniques, including 3-bit post-training quantization. Key findings include:
1. **LLM Performance under ICL**: Pre-trained dense models struggle with nuanced contextual features compared to state-of-the-art fine-tuned models.
2. **Model Compression Impact**: 3-bit post-training quantization leads to varying degrees of performance reduction on the benchmark, with some tasks showing marginal performance drops and others experiencing significant reductions.
3. **Task-Specific Analysis**: The paper provides detailed analysis of each task, including coreference resolution, dialogue state tracking, implicit discourse relation classification, and query rewriting, highlighting the challenges and improvements across different model sizes and compression techniques.
The study concludes that while larger models generally perform better, the impact of quantization on context understanding is mixed, and further research is needed to understand the limitations and potential of LLMs in handling complex linguistic contexts.