Understanding Exploring LLM-based Agents for Root Cause Analysis

This paper explores the use of LLM-based agents for root cause analysis (RCA) in cloud incident management. The growing complexity of cloud-based software systems has made incident management a critical part of the software development lifecycle. RCA is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience. Automation of RCA can save time and reduce the burden on on-call engineers. Recent research has used LLMs for RCA, but these approaches lack the ability to dynamically collect diagnostic information such as logs, metrics, or databases, limiting their effectiveness. This work explores the use of LLM-based agents to address this limitation. We present an empirical evaluation of a ReACT agent equipped with retrieval tools on an out-of-distribution dataset of production incidents. Results show that ReACT performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReACT agent with tools that give it access to external diagnostic services used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work and highlight practical considerations for implementing such a system in practice. The paper discusses the challenges of cloud incident management, particularly RCA, which is one of the most labor- and skill-intensive components of the incident management lifecycle. RCA requires the collection of novel diagnostic data not present in the incident report, which LLMs currently lack the ability to do. The paper proposes the use of LLM-based agents, which can reason, plan, and interact with the external environment to collect new information, to address this limitation. Despite the remarkable capabilities of LLM-based agents, adapting them for RCA presents significant challenges due to the confidentiality and out-of-distribution nature of incident data. The paper presents an empirical evaluation of an LLM-based agent, REACT, for RCA in cloud incident management. The goal is to answer two important questions: 1) Can LLM agents be effective at RCA in the absence of fine-tuning? and 2) What are the practical considerations of using LLM agents in real-world scenarios? The paper also presents a case study of a practical implementation of an LLM agent for RCA, fully equipped with team-specific diagnostic resources, in collaboration with another team at Microsoft. The results show the potential of LLM-based agents and the challenges involved in implementing real-world systems capable of fully autonomous RCA.This paper explores the use of LLM-based agents for root cause analysis (RCA) in cloud incident management. The growing complexity of cloud-based software systems has made incident management a critical part of the software development lifecycle. RCA is a demanding task for on-call engineers, requiring deep domain knowledge and extensive experience. Automation of RCA can save time and reduce the burden on on-call engineers. Recent research has used LLMs for RCA, but these approaches lack the ability to dynamically collect diagnostic information such as logs, metrics, or databases, limiting their effectiveness. This work explores the use of LLM-based agents to address this limitation. We present an empirical evaluation of a ReACT agent equipped with retrieval tools on an out-of-distribution dataset of production incidents. Results show that ReACT performs competitively with strong retrieval and reasoning baselines, but with highly increased factual accuracy. We then extend this evaluation by incorporating discussions associated with incident reports as additional inputs for the models, which surprisingly does not yield significant performance improvements. Lastly, we conduct a case study with a team at Microsoft to equip the ReACT agent with tools that give it access to external diagnostic services used by the team for manual RCA. Our results show how agents can overcome the limitations of prior work and highlight practical considerations for implementing such a system in practice. The paper discusses the challenges of cloud incident management, particularly RCA, which is one of the most labor- and skill-intensive components of the incident management lifecycle. RCA requires the collection of novel diagnostic data not present in the incident report, which LLMs currently lack the ability to do. The paper proposes the use of LLM-based agents, which can reason, plan, and interact with the external environment to collect new information, to address this limitation. Despite the remarkable capabilities of LLM-based agents, adapting them for RCA presents significant challenges due to the confidentiality and out-of-distribution nature of incident data. The paper presents an empirical evaluation of an LLM-based agent, REACT, for RCA in cloud incident management. The goal is to answer two important questions: 1) Can LLM agents be effective at RCA in the absence of fine-tuning? and 2) What are the practical considerations of using LLM agents in real-world scenarios? The paper also presents a case study of a practical implementation of an LLM agent for RCA, fully equipped with team-specific diagnostic resources, in collaboration with another team at Microsoft. The results show the potential of LLM-based agents and the challenges involved in implementing real-world systems capable of fully autonomous RCA.

Exploring LLM-based Agents for Root Cause Analysis

2024 | Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan