Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

2024 | Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, Wen-Chih Peng
This paper proposes RUN, a novel framework for root cause analysis in microservice systems using neural Granger causal discovery with contrastive learning. The challenge of identifying root causes in microservices arises due to the complex and dynamic relationships between services, making it difficult for site reliability engineers to pinpoint the cause of system failures. Traditional methods, such as the PC-algorithm, fail to capture temporal dependencies in time series data, leading to inaccurate root cause identification. RUN addresses these limitations by integrating contextual information from time series data and leveraging a time series forecasting model to construct a causal graph. It also incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. RUN enhances the backbone encoder by maximizing the agreement among instances with the same timestamp but different contexts, enabling the capture of contextual information. It then uses neural Granger causal discovery to explore the causal relationships among variables. The causal graph is further refined by pruning spurious edges to form a Directed Acyclic Graph (DAG). Finally, the diagnosis stage applies Pagerank with a personalization vector to identify the root cause of the trigger point. Extensive experiments on synthetic and real-world microservice-based datasets demonstrate that RUN significantly outperforms state-of-the-art root cause analysis methods. The framework is validated on the sock-shop dataset, where it achieves notable improvements in identifying root causes such as CPU hog and memory leak. The ablation study shows that the pre-training stage and contrastive learning are crucial for performance, while the inclusion of negative pairs does not significantly affect results. RUN's approach highlights the importance of temporal dependency in root cause analysis, particularly in microservice systems. By leveraging temporal information through neural Granger causal discovery, RUN provides a more accurate and efficient method for identifying root causes in complex systems. The framework is publicly available, enabling further research and application in microservice-based environments.This paper proposes RUN, a novel framework for root cause analysis in microservice systems using neural Granger causal discovery with contrastive learning. The challenge of identifying root causes in microservices arises due to the complex and dynamic relationships between services, making it difficult for site reliability engineers to pinpoint the cause of system failures. Traditional methods, such as the PC-algorithm, fail to capture temporal dependencies in time series data, leading to inaccurate root cause identification. RUN addresses these limitations by integrating contextual information from time series data and leveraging a time series forecasting model to construct a causal graph. It also incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. RUN enhances the backbone encoder by maximizing the agreement among instances with the same timestamp but different contexts, enabling the capture of contextual information. It then uses neural Granger causal discovery to explore the causal relationships among variables. The causal graph is further refined by pruning spurious edges to form a Directed Acyclic Graph (DAG). Finally, the diagnosis stage applies Pagerank with a personalization vector to identify the root cause of the trigger point. Extensive experiments on synthetic and real-world microservice-based datasets demonstrate that RUN significantly outperforms state-of-the-art root cause analysis methods. The framework is validated on the sock-shop dataset, where it achieves notable improvements in identifying root causes such as CPU hog and memory leak. The ablation study shows that the pre-training stage and contrastive learning are crucial for performance, while the inclusion of negative pairs does not significantly affect results. RUN's approach highlights the importance of temporal dependency in root cause analysis, particularly in microservice systems. By leveraging temporal information through neural Granger causal discovery, RUN provides a more accurate and efficient method for identifying root causes in complex systems. The framework is publicly available, enabling further research and application in microservice-based environments.
Reach us at info@study.space