Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

24 Jan 2024 | Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, Saravan Rajmohan
This paper presents an in-context learning approach for automated root cause analysis (RCA) of cloud incidents, which eliminates the need for fine-tuning large language models (LLMs) like GPT-4. The approach leverages historical incident data as in-context examples to equip the LLM with domain-specific knowledge, enabling it to perform RCA without the high computational and maintenance costs associated with fine-tuning. The method was evaluated on 101,308 production incidents from CompanyX, comparing several LLMs using multiple metrics. The results show that the in-context learning approach outperforms fine-tuned models like GPT-3 by an average of 24.8% across all metrics, with an impressive 49.7% improvement over the zero-shot model. Human evaluation further confirms its superiority, achieving a 43.5% improvement in correctness and an 8.7% enhancement in readability. The approach demonstrates the viability of using a vanilla GPT model for RCA, avoiding the high costs of fine-tuning. The study also addresses several research questions, including the effectiveness of in-context examples, the impact of example relevance, and the influence of example ordering. The findings highlight the potential of in-context learning for improving RCA in cloud environments, offering a cost-effective and scalable solution for incident diagnosis.This paper presents an in-context learning approach for automated root cause analysis (RCA) of cloud incidents, which eliminates the need for fine-tuning large language models (LLMs) like GPT-4. The approach leverages historical incident data as in-context examples to equip the LLM with domain-specific knowledge, enabling it to perform RCA without the high computational and maintenance costs associated with fine-tuning. The method was evaluated on 101,308 production incidents from CompanyX, comparing several LLMs using multiple metrics. The results show that the in-context learning approach outperforms fine-tuned models like GPT-3 by an average of 24.8% across all metrics, with an impressive 49.7% improvement over the zero-shot model. Human evaluation further confirms its superiority, achieving a 43.5% improvement in correctness and an 8.7% enhancement in readability. The approach demonstrates the viability of using a vanilla GPT model for RCA, avoiding the high costs of fine-tuning. The study also addresses several research questions, including the effectiveness of in-context examples, the impact of example relevance, and the influence of example ordering. The findings highlight the potential of in-context learning for improving RCA in cloud environments, offering a cost-effective and scalable solution for incident diagnosis.
Reach us at info@study.space
Understanding Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4