**SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents**
This paper introduces SOTOPIA-π, an interactive learning method designed to enhance the social intelligence of language agents. The method leverages behavior cloning and self-reinforcement training on filtered social interaction data, rated by a large language model (LLM) such as GPT-4. The goal is to improve the social goal completion ability of language agents while maintaining their general QA capabilities and enhancing safety.
**Key Contributions:**
1. **SOTOPIA-π Framework:** SOTOPIA-π generates new social tasks, collects data from expert and agent policies, and updates the agent policy based on positive data rated by GPT-4.
2. **Performance Improvement:** The method significantly improves the social goal completion ability of a 7B LLM, approaching the performance of GPT-4.
3. **LLM Rating Limitations:** The gap between GPT-4-based and human evaluation increases, highlighting the need for alternative evaluation models.
4. **Safety and Generalization:** SOTOPIA-π enhances safety and reduces toxicity while preserving the general QA ability of the models on the MMLU benchmark.
**Experimental Settings:**
- **Agent Models:** GPT-4 and Mistral-7B.
- **Training:** Efficient finetuning on quantized LLMs (QLoRA) with behavior cloning, self-reinforcement, and their combination.
- **Evaluation:** Human and GPT-4 ratings on the SOTOPIA-EVAL benchmark.
**Results:**
- **Social Goal Completion:** The best model (Mistral-7B with behavior cloning followed by self-reinforcement) achieves a goal completion score of 5.71, nearly matching GPT-4's score of 5.89.
- **Safety and Generalization:** The trained models engage more, are safer, more persuasive, and less toxic, while maintaining their general QA ability.
**Future Work:**
- **Online Reinforcement Learning:** Investigate online methods like PPO.
- **Learning from Humans:** Explore using human interaction data.
- **Safety Metrics:** Develop metrics for all social tasks.
- **Robust Evaluation:** Improve evaluation methods for social intelligence tasks.
**Limitations:**
- **LLM as Evaluator:** GPT-4 may introduce biases in evaluating social performance.
- **Social Biases:** Potential social biases in the interactive system.
- **Ethical Considerations:** Ensure responsible and unbiased AI development.**SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents**
This paper introduces SOTOPIA-π, an interactive learning method designed to enhance the social intelligence of language agents. The method leverages behavior cloning and self-reinforcement training on filtered social interaction data, rated by a large language model (LLM) such as GPT-4. The goal is to improve the social goal completion ability of language agents while maintaining their general QA capabilities and enhancing safety.
**Key Contributions:**
1. **SOTOPIA-π Framework:** SOTOPIA-π generates new social tasks, collects data from expert and agent policies, and updates the agent policy based on positive data rated by GPT-4.
2. **Performance Improvement:** The method significantly improves the social goal completion ability of a 7B LLM, approaching the performance of GPT-4.
3. **LLM Rating Limitations:** The gap between GPT-4-based and human evaluation increases, highlighting the need for alternative evaluation models.
4. **Safety and Generalization:** SOTOPIA-π enhances safety and reduces toxicity while preserving the general QA ability of the models on the MMLU benchmark.
**Experimental Settings:**
- **Agent Models:** GPT-4 and Mistral-7B.
- **Training:** Efficient finetuning on quantized LLMs (QLoRA) with behavior cloning, self-reinforcement, and their combination.
- **Evaluation:** Human and GPT-4 ratings on the SOTOPIA-EVAL benchmark.
**Results:**
- **Social Goal Completion:** The best model (Mistral-7B with behavior cloning followed by self-reinforcement) achieves a goal completion score of 5.71, nearly matching GPT-4's score of 5.89.
- **Safety and Generalization:** The trained models engage more, are safer, more persuasive, and less toxic, while maintaining their general QA ability.
**Future Work:**
- **Online Reinforcement Learning:** Investigate online methods like PPO.
- **Learning from Humans:** Explore using human interaction data.
- **Safety Metrics:** Develop metrics for all social tasks.
- **Robust Evaluation:** Improve evaluation methods for social intelligence tasks.
**Limitations:**
- **LLM as Evaluator:** GPT-4 may introduce biases in evaluating social performance.
- **Social Biases:** Potential social biases in the interactive system.
- **Ethical Considerations:** Ensure responsible and unbiased AI development.