LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12 Aug 2024 | Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Existing approaches compress prompts by removing tokens or lexical units based on information entropy from a causal language model like LLaMa7B. However, information entropy may be suboptimal as it only leverages unidirectional context and may not capture all essential information. To address this, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information and introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to ensure the compressed prompt remains faithful to the original. Using a Transformer encoder, we capture all essential information from the full bidirectional context. Our approach reduces latency by explicitly learning the compression objective with smaller models like XLM-RoBERTa-large and mBERT. We evaluate our method on in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. We construct an extractive text compression dataset by distilling knowledge from an LLM (GPT-4) to compress texts without losing crucial information. We then assign binary labels to each token in the original text to determine if it should be preserved or discarded after compression. We introduce two quality control metrics to filter low-quality samples: variation rate and alignment gap. We formulate prompt compression as a binary token classification problem to ensure the compressed prompt remains faithful to the original content. We use a Transformer encoder as the feature extractor to leverage information from the bidirectional contexts of each token. We train the classification model on the dataset constructed in Section 3 from MeetingBank. During inference, we determine whether to preserve or discard each token in the original prompt based on its probability calculated by our classification model. Our approach achieves significant performance gains on both in-domain and out-of-domain benchmarks, demonstrating the effectiveness of our constructed dataset and the importance of optimizing the compression model using prompt compression knowledge. Our model is also more efficient, with lower latency and reduced GPU memory usage compared to existing methods. We show that our model can effectively maintain the most informative words as the compression ratio increases, and that it can be integrated with other methods like LongLLMLingua to preserve more key information relevant to the question.LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Existing approaches compress prompts by removing tokens or lexical units based on information entropy from a causal language model like LLaMa7B. However, information entropy may be suboptimal as it only leverages unidirectional context and may not capture all essential information. To address this, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information and introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to ensure the compressed prompt remains faithful to the original. Using a Transformer encoder, we capture all essential information from the full bidirectional context. Our approach reduces latency by explicitly learning the compression objective with smaller models like XLM-RoBERTa-large and mBERT. We evaluate our method on in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. We construct an extractive text compression dataset by distilling knowledge from an LLM (GPT-4) to compress texts without losing crucial information. We then assign binary labels to each token in the original text to determine if it should be preserved or discarded after compression. We introduce two quality control metrics to filter low-quality samples: variation rate and alignment gap. We formulate prompt compression as a binary token classification problem to ensure the compressed prompt remains faithful to the original content. We use a Transformer encoder as the feature extractor to leverage information from the bidirectional contexts of each token. We train the classification model on the dataset constructed in Section 3 from MeetingBank. During inference, we determine whether to preserve or discard each token in the original prompt based on its probability calculated by our classification model. Our approach achieves significant performance gains on both in-domain and out-of-domain benchmarks, demonstrating the effectiveness of our constructed dataset and the importance of optimizing the compression model using prompt compression knowledge. Our model is also more efficient, with lower latency and reduced GPU memory usage compared to existing methods. We show that our model can effectively maintain the most informative words as the compression ratio increases, and that it can be integrated with other methods like LongLLMLingua to preserve more key information relevant to the question.
Reach us at info@study.space
[slides and audio] LLMLingua-2%3A Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression