[slides] Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

The paper introduces CoDI-Eval, a new benchmark for evaluating the controllable text generation (CTG) capabilities of large language models (LLMs). CoDI-Eval addresses the gap in existing evaluation methods by incorporating diversified instructions in natural language formats, allowing for a more comprehensive assessment of LLMs' generalized performance. The benchmark includes five typical CTG tasks: sentiment, topic, length, keyword, and toxicity avoidance, with a multi-aspect task that combines two aspects for more challenging evaluations. The evaluation process is automated and reliable, using various metrics to assess accuracy and diversity. Extensive experiments with representative LLMs (e.g., ChatGPT, Vicuna) reveal their limitations in following specific constraints, highlighting a significant gap between open-source and commercial closed-source LLMs. The paper also discusses the diversity of instructions, the quality of evaluations, and the reasons for LLMs' performance in certain tasks. Overall, CoDI-Eval provides a valuable tool for researchers to improve the controllability of LLMs' responses to instructions.The paper introduces CoDI-Eval, a new benchmark for evaluating the controllable text generation (CTG) capabilities of large language models (LLMs). CoDI-Eval addresses the gap in existing evaluation methods by incorporating diversified instructions in natural language formats, allowing for a more comprehensive assessment of LLMs' generalized performance. The benchmark includes five typical CTG tasks: sentiment, topic, length, keyword, and toxicity avoidance, with a multi-aspect task that combines two aspects for more challenging evaluations. The evaluation process is automated and reliable, using various metrics to assess accuracy and diversity. Extensive experiments with representative LLMs (e.g., ChatGPT, Vicuna) reveal their limitations in following specific constraints, highlighting a significant gap between open-source and commercial closed-source LLMs. The paper also discusses the diversity of instructions, the quality of evaluations, and the reasons for LLMs' performance in certain tasks. Overall, CoDI-Eval provides a valuable tool for researchers to improve the controllability of LLMs' responses to instructions.

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

1 Jan 2024 | Yihan Chen1, Benfeng Xu1, Quan Wang2, Yi Liu3, Zhendong Mao1*