7 Mar 2024 | Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan
The paper explores the use of Large Language Models (LLMs) for Root Cause Analysis (RCA) in cloud-based software systems, aiming to automate the time-consuming and skill-intensive task of incident management. The authors present a thorough empirical evaluation of a ReACT agent equipped with retrieval tools on a dataset of production incidents from a large IT corporation. The results show that ReACT performs competitively with strong retrieval and reasoning baselines, but with significantly higher factual accuracy. The study also investigates the impact of incorporating discussion comments from historical incident reports, which surprisingly does not yield significant performance improvements. Additionally, a case study with a team at Microsoft demonstrates how LLM-based agents can overcome limitations by accessing external diagnostic services, highlighting practical considerations for implementing such systems in real-world scenarios. The research contributes to the understanding of the potential and challenges of using LLM-based agents for RCA, providing insights into their effectiveness and practical implementation.The paper explores the use of Large Language Models (LLMs) for Root Cause Analysis (RCA) in cloud-based software systems, aiming to automate the time-consuming and skill-intensive task of incident management. The authors present a thorough empirical evaluation of a ReACT agent equipped with retrieval tools on a dataset of production incidents from a large IT corporation. The results show that ReACT performs competitively with strong retrieval and reasoning baselines, but with significantly higher factual accuracy. The study also investigates the impact of incorporating discussion comments from historical incident reports, which surprisingly does not yield significant performance improvements. Additionally, a case study with a team at Microsoft demonstrates how LLM-based agents can overcome limitations by accessing external diagnostic services, highlighting practical considerations for implementing such systems in real-world scenarios. The research contributes to the understanding of the potential and challenges of using LLM-based agents for RCA, providing insights into their effectiveness and practical implementation.