3 Oct 2024 | Paula Dolores Rescala, Manoel Horta Ribeiro, Tiancheng Hu, Robert West
The paper investigates the persuasive capabilities of large language models (LLMs) by examining their ability to recognize convincing arguments and predict stances based on demographic and belief traits. The study uses a dataset from Durmus and Cardie (2018) that includes debates, votes, and user traits. The main research questions (RQ1, RQ2, RQ3) focus on whether LLMs can:
1. **Judge argument quality**: Distinguish between strong and weak arguments.
2. **Predict stances**: Predict individuals' stances on specific topics based on their demographics and beliefs.
3. **Determine argument appeal**: Assess how arguments appeal to individuals based on their demographics.
The study evaluates four LLMs (GPT-3.5, GPT-4, Llama 2, and Mistral 7B) on these tasks. Key findings include:
- **Argument Quality**: GPT-4 performs similarly to human voters in judging argument quality.
- **Stances Based on Demographics and Beliefs**: LLMs perform similarly to crowdworkers in predicting stances before and after reading debates.
- **Argument Appeal**: LLMs perform similarly to crowdworkers in recognizing users' opinions after reading debates.
- **Stacking LLMs**: Combining predictions from different LLMs significantly improves performance, surpassing human performance.
The research contributes to the understanding of LLMs' persuasive capabilities and highlights the potential risks of personalized misinformation and propaganda. The study also discusses limitations, ethical considerations, and the need for further research to expand the scope of evaluation.The paper investigates the persuasive capabilities of large language models (LLMs) by examining their ability to recognize convincing arguments and predict stances based on demographic and belief traits. The study uses a dataset from Durmus and Cardie (2018) that includes debates, votes, and user traits. The main research questions (RQ1, RQ2, RQ3) focus on whether LLMs can:
1. **Judge argument quality**: Distinguish between strong and weak arguments.
2. **Predict stances**: Predict individuals' stances on specific topics based on their demographics and beliefs.
3. **Determine argument appeal**: Assess how arguments appeal to individuals based on their demographics.
The study evaluates four LLMs (GPT-3.5, GPT-4, Llama 2, and Mistral 7B) on these tasks. Key findings include:
- **Argument Quality**: GPT-4 performs similarly to human voters in judging argument quality.
- **Stances Based on Demographics and Beliefs**: LLMs perform similarly to crowdworkers in predicting stances before and after reading debates.
- **Argument Appeal**: LLMs perform similarly to crowdworkers in recognizing users' opinions after reading debates.
- **Stacking LLMs**: Combining predictions from different LLMs significantly improves performance, surpassing human performance.
The research contributes to the understanding of LLMs' persuasive capabilities and highlights the potential risks of personalized misinformation and propaganda. The study also discusses limitations, ethical considerations, and the need for further research to expand the scope of evaluation.