Understanding LLMLingua-2%3A Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

This paper addresses the challenge of task-agnostic prompt compression to enhance both generalizability and efficiency. Existing methods often rely on information entropy from causal language models (CSLMs) to remove redundant tokens, but this approach may not capture all essential information and can be suboptimal. To overcome these issues, the authors propose a data distillation procedure to derive knowledge from an LLM (GPT-4) for effective prompt compression without losing crucial information. They introduce an extractive text compression dataset and formulate prompt compression as a token classification problem to ensure faithfulness to the original prompt. The proposed approach uses a Transformer encoder to capture bidirectional context and explicitly learns the compression objective with smaller models like XLM-RoBERTa-large and mBERT. Extensive experiments on in-domain and out-of-domain datasets show significant performance gains over strong baselines, demonstrating robust generalization across different LLMs. Additionally, the model achieves 3x-6x faster compression compared to existing methods, with end-to-end latency reduced by 1.6x-2.9x while achieving compression ratios of 2x-5x. The paper also discusses the limitations and future work, including the need for further evaluation on more diverse datasets.This paper addresses the challenge of task-agnostic prompt compression to enhance both generalizability and efficiency. Existing methods often rely on information entropy from causal language models (CSLMs) to remove redundant tokens, but this approach may not capture all essential information and can be suboptimal. To overcome these issues, the authors propose a data distillation procedure to derive knowledge from an LLM (GPT-4) for effective prompt compression without losing crucial information. They introduce an extractive text compression dataset and formulate prompt compression as a token classification problem to ensure faithfulness to the original prompt. The proposed approach uses a Transformer encoder to capture bidirectional context and explicitly learns the compression objective with smaller models like XLM-RoBERTa-large and mBERT. Extensive experiments on in-domain and out-of-domain datasets show significant performance gains over strong baselines, demonstrating robust generalization across different LLMs. Additionally, the model achieves 3x-6x faster compression compared to existing methods, with end-to-end latency reduced by 1.6x-2.9x while achieving compression ratios of 2x-5x. The paper also discusses the limitations and future work, including the need for further evaluation on more diverse datasets.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12 Aug 2024 | Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang