Understanding Suri%3A Multi-constraint Instruction Following for Long-form Text Generation

This paper addresses the challenge of generating long-form text with complex, multi-constraint instructions using large language models (LLMs). The authors introduce Suri, a dataset containing 20K human-written long-form texts paired with LLM-generated backtranslated instructions that include multiple complex constraints. Due to the difficulty in collecting human preference judgments for long-form texts, the authors propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. I-ORPO uses synthetically corrupted instructions generated by an LLM as negative feedback instead of receiving negative feedback from dispreferred responses. The paper evaluates the effectiveness of Suri and I-ORPO through supervised fine-tuning (SFT) and I-ORPO fine-t-uning on the Mistral-7b-Instruct-v0.2 model. The results show that both SFT and I-ORPO models generate significantly longer texts (~5K tokens) compared to the base model without significant quality deterioration. Human evaluations reveal that while both models satisfy most constraints, I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. The paper also discusses limitations, ethical considerations, and related work, highlighting the importance of combining complex instructions and long-form responses in instruction-following datasets.This paper addresses the challenge of generating long-form text with complex, multi-constraint instructions using large language models (LLMs). The authors introduce Suri, a dataset containing 20K human-written long-form texts paired with LLM-generated backtranslated instructions that include multiple complex constraints. Due to the difficulty in collecting human preference judgments for long-form texts, the authors propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. I-ORPO uses synthetically corrupted instructions generated by an LLM as negative feedback instead of receiving negative feedback from dispreferred responses. The paper evaluates the effectiveness of Suri and I-ORPO through supervised fine-tuning (SFT) and I-ORPO fine-t-uning on the Mistral-7b-Instruct-v0.2 model. The results show that both SFT and I-ORPO models generate significantly longer texts (~5K tokens) compared to the base model without significant quality deterioration. Human evaluations reveal that while both models satisfy most constraints, I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. The paper also discusses limitations, ethical considerations, and related work, highlighting the importance of combining complex instructions and long-form responses in instruction-following datasets.

Suri: Multi-constraint Instruction Following for Long-form Text Generation

2 Oct 2024 | Chau Minh Pham, Simeng Sun*, Mohit Iyyer