This paper presents a study on leveraging X-lifecycle data (i.e., service architecture, dependencies, and functionalities) to improve the performance of two critical tasks in cloud incident management: (1) automatically generating root cause recommendations for dependency failure-related incidents, and (2) identifying the ontology of service monitors used for automatically detecting incidents. The study demonstrates that augmenting contextual information from different stages of the software development lifecycle (SDLC) significantly improves the accuracy of these tasks. The experiments were conducted on real-world incident and monitor datasets from Microsoft, with 353 incidents and 260 monitors used for evaluation.
For root cause analysis of dependency failures, the study proposes a method that incorporates upstream service dependencies and their properties into the prompt used to query large language models (LLMs), such as GPT-4. This approach significantly improves the accuracy of root cause recommendations compared to state-of-the-art methods. The results show that incorporating service properties and upstream dependency information on top of in-context examples can significantly improve the root cause recommendation accuracy.
For monitor categorization, the study demonstrates that leveraging service and component functionalities can boost the classification accuracy while predicting the resource and SLO classes. The results show a consistent enhancement in SLO class predictions when additional context from service descriptions is incorporated.
The study also highlights the importance of using contextual information from different stages of the SDLC to improve the performance of incident management tasks. The results show that incorporating additional contextual information, such as service descriptions, can significantly improve the accuracy of root cause analysis and monitor categorization. The study also identifies that the effect of additional service context is more prominent on SLO classes, as the overall description helps to provide better context on the service level objectives.
The study concludes that leveraging X-lifecycle data can significantly improve the performance of incident management tasks, and that additional contextual information, such as service descriptions, can be particularly useful in improving the accuracy of root cause analysis and monitor categorization. The study also highlights the importance of using contextual information from different stages of the SDLC to improve the performance of incident management tasks.This paper presents a study on leveraging X-lifecycle data (i.e., service architecture, dependencies, and functionalities) to improve the performance of two critical tasks in cloud incident management: (1) automatically generating root cause recommendations for dependency failure-related incidents, and (2) identifying the ontology of service monitors used for automatically detecting incidents. The study demonstrates that augmenting contextual information from different stages of the software development lifecycle (SDLC) significantly improves the accuracy of these tasks. The experiments were conducted on real-world incident and monitor datasets from Microsoft, with 353 incidents and 260 monitors used for evaluation.
For root cause analysis of dependency failures, the study proposes a method that incorporates upstream service dependencies and their properties into the prompt used to query large language models (LLMs), such as GPT-4. This approach significantly improves the accuracy of root cause recommendations compared to state-of-the-art methods. The results show that incorporating service properties and upstream dependency information on top of in-context examples can significantly improve the root cause recommendation accuracy.
For monitor categorization, the study demonstrates that leveraging service and component functionalities can boost the classification accuracy while predicting the resource and SLO classes. The results show a consistent enhancement in SLO class predictions when additional context from service descriptions is incorporated.
The study also highlights the importance of using contextual information from different stages of the SDLC to improve the performance of incident management tasks. The results show that incorporating additional contextual information, such as service descriptions, can significantly improve the accuracy of root cause analysis and monitor categorization. The study also identifies that the effect of additional service context is more prominent on SLO classes, as the overall description helps to provide better context on the service level objectives.
The study concludes that leveraging X-lifecycle data can significantly improve the performance of incident management tasks, and that additional contextual information, such as service descriptions, can be particularly useful in improving the accuracy of root cause analysis and monitor categorization. The study also highlights the importance of using contextual information from different stages of the SDLC to improve the performance of incident management tasks.