SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents

SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents

25 Apr 2024 | Ruiyi Wang*, Haofei Yu*, Wenxin Zhang*, Zhengyang Qi*, Maarten Sap, Graham Neubig, Yonatan Bisk, Hao Zhu
**SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents** This paper introduces SOTOPIA-π, an interactive learning method designed to enhance the social intelligence of language agents. The method leverages behavior cloning and self-reinforcement training on filtered social interaction data, rated by a large language model (LLM) such as GPT-4. The goal is to improve the social goal completion ability of language agents while maintaining their general QA capabilities and enhancing safety. **Key Contributions:** 1. **SOTOPIA-π Framework:** SOTOPIA-π generates new social tasks, collects data from expert and agent policies, and updates the agent policy based on positive data rated by GPT-4. 2. **Performance Improvement:** The method significantly improves the social goal completion ability of a 7B LLM, approaching the performance of GPT-4. 3. **LLM Rating Limitations:** The gap between GPT-4-based and human evaluation increases, highlighting the need for alternative evaluation models. 4. **Safety and Generalization:** SOTOPIA-π enhances safety and reduces toxicity while preserving the general QA ability of the models on the MMLU benchmark. **Experimental Settings:** - **Agent Models:** GPT-4 and Mistral-7B. - **Training:** Efficient finetuning on quantized LLMs (QLoRA) with behavior cloning, self-reinforcement, and their combination. - **Evaluation:** Human and GPT-4 ratings on the SOTOPIA-EVAL benchmark. **Results:** - **Social Goal Completion:** The best model (Mistral-7B with behavior cloning followed by self-reinforcement) achieves a goal completion score of 5.71, nearly matching GPT-4's score of 5.89. - **Safety and Generalization:** The trained models engage more, are safer, more persuasive, and less toxic, while maintaining their general QA ability. **Future Work:** - **Online Reinforcement Learning:** Investigate online methods like PPO. - **Learning from Humans:** Explore using human interaction data. - **Safety Metrics:** Develop metrics for all social tasks. - **Robust Evaluation:** Improve evaluation methods for social intelligence tasks. **Limitations:** - **LLM as Evaluator:** GPT-4 may introduce biases in evaluating social performance. - **Social Biases:** Potential social biases in the interactive system. - **Ethical Considerations:** Ensure responsible and unbiased AI development.**SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents** This paper introduces SOTOPIA-π, an interactive learning method designed to enhance the social intelligence of language agents. The method leverages behavior cloning and self-reinforcement training on filtered social interaction data, rated by a large language model (LLM) such as GPT-4. The goal is to improve the social goal completion ability of language agents while maintaining their general QA capabilities and enhancing safety. **Key Contributions:** 1. **SOTOPIA-π Framework:** SOTOPIA-π generates new social tasks, collects data from expert and agent policies, and updates the agent policy based on positive data rated by GPT-4. 2. **Performance Improvement:** The method significantly improves the social goal completion ability of a 7B LLM, approaching the performance of GPT-4. 3. **LLM Rating Limitations:** The gap between GPT-4-based and human evaluation increases, highlighting the need for alternative evaluation models. 4. **Safety and Generalization:** SOTOPIA-π enhances safety and reduces toxicity while preserving the general QA ability of the models on the MMLU benchmark. **Experimental Settings:** - **Agent Models:** GPT-4 and Mistral-7B. - **Training:** Efficient finetuning on quantized LLMs (QLoRA) with behavior cloning, self-reinforcement, and their combination. - **Evaluation:** Human and GPT-4 ratings on the SOTOPIA-EVAL benchmark. **Results:** - **Social Goal Completion:** The best model (Mistral-7B with behavior cloning followed by self-reinforcement) achieves a goal completion score of 5.71, nearly matching GPT-4's score of 5.89. - **Safety and Generalization:** The trained models engage more, are safer, more persuasive, and less toxic, while maintaining their general QA ability. **Future Work:** - **Online Reinforcement Learning:** Investigate online methods like PPO. - **Learning from Humans:** Explore using human interaction data. - **Safety Metrics:** Develop metrics for all social tasks. - **Robust Evaluation:** Improve evaluation methods for social intelligence tasks. **Limitations:** - **LLM as Evaluator:** GPT-4 may introduce biases in evaluating social performance. - **Social Biases:** Potential social biases in the interactive system. - **Ethical Considerations:** Ensure responsible and unbiased AI development.
Reach us at info@study.space