The paper investigates the security risks of manipulated knowledge in LLM-based multi-agent systems, focusing on the spread of counterfactual and toxic knowledge. The authors construct a detailed threat model and a simulation environment to mirror real-world multi-agent deployments. They propose a two-stage attack method involving *Persuasiveness Injection* and *Manipulated Knowledge Injection* to explore the potential for unconscious spread of manipulated knowledge without explicit prompt manipulation.
The method leverages the vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to spread fabricated information. Extensive experiments demonstrate that the attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during communication. The results also show that these manipulations can persist through popular retrieval-augmented generation frameworks, where benign agents store and retrieve manipulated chat histories for future interactions.
The findings highlight significant security risks in LLM-based multi-agent systems, emphasizing the need for robust defenses such as introducing "guardian" agents and advanced fact-checking tools. The paper concludes with a discussion of the implications and potential solutions to mitigate the spread of manipulated knowledge.The paper investigates the security risks of manipulated knowledge in LLM-based multi-agent systems, focusing on the spread of counterfactual and toxic knowledge. The authors construct a detailed threat model and a simulation environment to mirror real-world multi-agent deployments. They propose a two-stage attack method involving *Persuasiveness Injection* and *Manipulated Knowledge Injection* to explore the potential for unconscious spread of manipulated knowledge without explicit prompt manipulation.
The method leverages the vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to spread fabricated information. Extensive experiments demonstrate that the attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during communication. The results also show that these manipulations can persist through popular retrieval-augmented generation frameworks, where benign agents store and retrieve manipulated chat histories for future interactions.
The findings highlight significant security risks in LLM-based multi-agent systems, emphasizing the need for robust defenses such as introducing "guardian" agents and advanced fact-checking tools. The paper concludes with a discussion of the implications and potential solutions to mitigate the spread of manipulated knowledge.