Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

1 Jan 2024 | Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao
This paper introduces CoDI-Eval, a new benchmark for evaluating the controllable text generation capabilities of large language models (LLMs). The benchmark is designed to systematically and comprehensively evaluate how well LLMs can follow instructions with various constraints. It includes a large collection of constraint-attributed instructions, focusing on both generalization and coverage. The benchmark features a diverse set of tasks, including sentiment, topic, keyword, length, and toxicity avoidance, as well as a multi-aspect task that combines multiple attributes. The instructions are generated through a two-step process: expansion and diversification, ensuring a wide range of natural language expressions. The benchmark also provides automated and easy-to-use evaluation methods for each task. The results show that top commercial LLMs like GPT-4 and ChatGPT perform well on CTG tasks, but there is still a significant gap between open-source and commercial LLMs. The benchmark aims to facilitate research into improving the controllability of LLMs' responses to instructions. The data and code are available at https://github.com/Xt-cyh/CoDI-Eval.This paper introduces CoDI-Eval, a new benchmark for evaluating the controllable text generation capabilities of large language models (LLMs). The benchmark is designed to systematically and comprehensively evaluate how well LLMs can follow instructions with various constraints. It includes a large collection of constraint-attributed instructions, focusing on both generalization and coverage. The benchmark features a diverse set of tasks, including sentiment, topic, keyword, length, and toxicity avoidance, as well as a multi-aspect task that combines multiple attributes. The instructions are generated through a two-step process: expansion and diversification, ensuring a wide range of natural language expressions. The benchmark also provides automated and easy-to-use evaluation methods for each task. The results show that top commercial LLMs like GPT-4 and ChatGPT perform well on CTG tasks, but there is still a significant gap between open-source and commercial LLMs. The benchmark aims to facilitate research into improving the controllability of LLMs' responses to instructions. The data and code are available at https://github.com/Xt-cyh/CoDI-Eval.
Reach us at info@study.space
[slides and audio] Benchmarking Large Language Models on Controllable Generation under Diversified Instructions