[slides] CodecLM%3A Aligning Language Models with Tailored Synthetic Data

**CodecLM: Aligning Language Models with Tailored Synthetic Data** **Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister** **Google Cloud AI Research, Google Cloud AI, Google Research** **Abstract:** Instruction tuning has emerged as a key approach to align large language models (LLMs) with specific task instructions, reducing the discrepancy between next-token prediction and user goals. To reduce the labor and time costs of human data collection or annotation, researchers are exploring the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLMs to increase instruction complexity, often neglecting downstream use cases. This paper introduces CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Inspired by the Encode-Decode principles, CodecLM uses LLMs as codecs to guide the data generation process. Seed instructions are encoded into metadata, which are then decoded to create tailored instructions. Self-Rubrics and Contrastive Filtering are introduced to tailor data-efficient samples. Extensive experiments on four open-domain instruction-following benchmarks validate the effectiveness of CodecLM over state-of-the-art methods. **Introduction:** Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing (NLP) tasks. Instruction tuning has become crucial for aligning LLMs with specific task instructions, improving their performance in diverse tasks. However, acquiring high-quality data through human annotation remains costly and challenging. Recent work explores generating instruction-response pairs using LLMs, but these methods often lack task-specific alignment. CodecLM addresses this by systematically generating tailored high-quality data for different downstream tasks. **CodecLM:** CodecLM leverages the Encode-Decode process, using a strong LLM as both encoder and decoder. Seed instructions are encoded into metadata, which captures the underlying instruction distribution. The metadata is then decoded into tailored instructions using Self-Rubrics and Contrastive Filtering. Self-Rubrics generates complex instructions based on metadata, while Contrastive Filtering selects the most effective instruction-response pairs. Extensive experiments on four open-domain instruction-following benchmarks demonstrate the effectiveness of CodecLM. **Related Work:** The paper discusses existing methods for instruction tuning and data generation, highlighting the limitations of current approaches. CodecLM aims to address these limitations by providing a unified framework for task-specific LLM alignment. **Problem Statement:** The problem statement focuses on open-domain instruction following, where instructions vary in input format and tasks. The goal is to generate high-quality instruction-response pairs using a strong LLM and fine-tune the target LLM on these pairs. **Experiments:** CodecLM is evaluated on multiple benchmarks, including Evol-Instruct, Vicuna, Self-Instruct, and Koala. Baseline methods are compared, and ablation**CodecLM: Aligning Language Models with Tailored Synthetic Data** **Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister** **Google Cloud AI Research, Google Cloud AI, Google Research** **Abstract:** Instruction tuning has emerged as a key approach to align large language models (LLMs) with specific task instructions, reducing the discrepancy between next-token prediction and user goals. To reduce the labor and time costs of human data collection or annotation, researchers are exploring the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLMs to increase instruction complexity, often neglecting downstream use cases. This paper introduces CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Inspired by the Encode-Decode principles, CodecLM uses LLMs as codecs to guide the data generation process. Seed instructions are encoded into metadata, which are then decoded to create tailored instructions. Self-Rubrics and Contrastive Filtering are introduced to tailor data-efficient samples. Extensive experiments on four open-domain instruction-following benchmarks validate the effectiveness of CodecLM over state-of-the-art methods. **Introduction:** Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing (NLP) tasks. Instruction tuning has become crucial for aligning LLMs with specific task instructions, improving their performance in diverse tasks. However, acquiring high-quality data through human annotation remains costly and challenging. Recent work explores generating instruction-response pairs using LLMs, but these methods often lack task-specific alignment. CodecLM addresses this by systematically generating tailored high-quality data for different downstream tasks. **CodecLM:** CodecLM leverages the Encode-Decode process, using a strong LLM as both encoder and decoder. Seed instructions are encoded into metadata, which captures the underlying instruction distribution. The metadata is then decoded into tailored instructions using Self-Rubrics and Contrastive Filtering. Self-Rubrics generates complex instructions based on metadata, while Contrastive Filtering selects the most effective instruction-response pairs. Extensive experiments on four open-domain instruction-following benchmarks demonstrate the effectiveness of CodecLM. **Related Work:** The paper discusses existing methods for instruction tuning and data generation, highlighting the limitations of current approaches. CodecLM aims to address these limitations by providing a unified framework for task-specific LLM alignment. **Problem Statement:** The problem statement focuses on open-domain instruction following, where instructions vary in input format and tasks. The goal is to generate high-quality instruction-response pairs using a strong LLM and fine-tune the target LLM on these pairs. **Experiments:** CodecLM is evaluated on multiple benchmarks, including Evol-Instruct, Vicuna, Self-Instruct, and Koala. Baseline methods are compared, and ablation

CodecLM: Aligning Language Models with Tailored Synthetic Data

8 Apr 2024 | Zifeng Wang†, Chun-Liang Li†, Vincent Perot*, Long T. Le†, Jin Miao†, Zizhao Zhang†, Chen-Yu Lee†, Tomas Pfister†