| Wei Xu*, Ling Huang†, Armando Fox*, David Patterson*, Michael I. Jordan*
This paper presents a methodology for automatically detecting system runtime problems by mining console logs. The approach combines source code analysis with information retrieval to create composite features from logs, then uses machine learning to detect operational problems. The method enables analyses that are impossible with previous methods due to its superior ability to create sophisticated features. It also distills analysis results into an operator-friendly one-page decision tree showing critical messages associated with detected problems. The approach was validated using the Darkstar online game server and the Hadoop File System, where it detected numerous real problems with high accuracy and few false positives. In the Hadoop case, it analyzed 24 million lines of console logs in 3 minutes. The methodology works on any size of textual console logs without requiring changes to the service software, human input, or knowledge of the software's internals.
The approach involves four key contributions: 1) a technique for analyzing source code to recover the structure inherent in console logs; 2) identification of common information in logs—state variables and object identifiers—and automatic creation of features from logs; 3) a machine learning and information retrieval methodology that effectively detects unusual patterns or anomalies across large collections of such features; and 4) automatic construction of a visualization that distills the results of anomaly detection in a compact and operator-friendly format.
The method uses structured information such as message types and variables to automatically create features that capture information conveyed in logs. It is the first work extracting information at this fine level of granularity from console logs for problem detection. The method is able to accurately parse all possible log messages, even those rarely seen in actual logs. It also eliminates most heuristics and guesses for log parsing used by existing solutions.
The method is able to detect anomalies by analyzing the correlations among different types of log messages. It creates features that accurately capture various correlations among log messages and performs anomaly detection on these features. The method is able to detect anomalies in both the Darkstar online game server and the Hadoop File System. For Darkstar, the method accurately detects performance anomalies immediately after they happen and provides hints as to the root cause. For Hadoop, it detects runtime anomalies that are commonly overlooked. The method is able to distill over 24 million lines of console logs to a one-page decision tree that a domain expert can readily understand. The method is able to process logs in a scalable and efficient manner, using Hadoop map-reduce jobs on cloud computing. The method is able to achieve nearly linear speedup for a few dollars per run.This paper presents a methodology for automatically detecting system runtime problems by mining console logs. The approach combines source code analysis with information retrieval to create composite features from logs, then uses machine learning to detect operational problems. The method enables analyses that are impossible with previous methods due to its superior ability to create sophisticated features. It also distills analysis results into an operator-friendly one-page decision tree showing critical messages associated with detected problems. The approach was validated using the Darkstar online game server and the Hadoop File System, where it detected numerous real problems with high accuracy and few false positives. In the Hadoop case, it analyzed 24 million lines of console logs in 3 minutes. The methodology works on any size of textual console logs without requiring changes to the service software, human input, or knowledge of the software's internals.
The approach involves four key contributions: 1) a technique for analyzing source code to recover the structure inherent in console logs; 2) identification of common information in logs—state variables and object identifiers—and automatic creation of features from logs; 3) a machine learning and information retrieval methodology that effectively detects unusual patterns or anomalies across large collections of such features; and 4) automatic construction of a visualization that distills the results of anomaly detection in a compact and operator-friendly format.
The method uses structured information such as message types and variables to automatically create features that capture information conveyed in logs. It is the first work extracting information at this fine level of granularity from console logs for problem detection. The method is able to accurately parse all possible log messages, even those rarely seen in actual logs. It also eliminates most heuristics and guesses for log parsing used by existing solutions.
The method is able to detect anomalies by analyzing the correlations among different types of log messages. It creates features that accurately capture various correlations among log messages and performs anomaly detection on these features. The method is able to detect anomalies in both the Darkstar online game server and the Hadoop File System. For Darkstar, the method accurately detects performance anomalies immediately after they happen and provides hints as to the root cause. For Hadoop, it detects runtime anomalies that are commonly overlooked. The method is able to distill over 24 million lines of console logs to a one-page decision tree that a domain expert can readily understand. The method is able to process logs in a scalable and efficient manner, using Hadoop map-reduce jobs on cloud computing. The method is able to achieve nearly linear speedup for a few dollars per run.