26 Mar 2025 | Wei Zhang, Xiangyuan Guan, Lu Yunhong, Jie Zhang, Shuangyong Song, Xianfu Cheng, Zhenhe Wu, Zhoujun Li
The paper introduces LEMUR, a novel log parsing framework that combines entropy sampling and chain-of-thought merging to enhance the accuracy and efficiency of log analysis. Traditional log parsers rely on manual rules and focus on statistical features, often failing to identify correct templates and overlook semantic information. LEMUR addresses these issues by:
1. **Entropy Sampling**: A novel method inspired by information theory to efficiently cluster typical logs based on their informational content. This method divides large-scale data into clusters and uses efficient sampling and clustering algorithms to ensure robust performance in large-scale log scenarios.
2. **Template Generation**: Utilizes information entropy to determine the variables and templates in log messages. The information entropy of tokens at the same location is calculated, and positions with high entropy are identified as variables, while others are fixed components.
3. **Chain-of-Thought Merging**: A three-hop approach using large language models (LLMs) to merge semantically similar but structurally different log templates. This involves:
- **Structure QA**: Examining structural differences.
- **Semantic QA**: Probing semantic equivalences.
- **Solution QA**: Deciding on merging based on prior analyses.
Experiments on public large-scale log datasets demonstrate that LEMUR outperforms existing methods in terms of grouping accuracy and F1 score, achieving state-of-the-art performance and impressive efficiency. The framework is designed to be unsupervised, making it suitable for real-world applications where annotated data is scarce.The paper introduces LEMUR, a novel log parsing framework that combines entropy sampling and chain-of-thought merging to enhance the accuracy and efficiency of log analysis. Traditional log parsers rely on manual rules and focus on statistical features, often failing to identify correct templates and overlook semantic information. LEMUR addresses these issues by:
1. **Entropy Sampling**: A novel method inspired by information theory to efficiently cluster typical logs based on their informational content. This method divides large-scale data into clusters and uses efficient sampling and clustering algorithms to ensure robust performance in large-scale log scenarios.
2. **Template Generation**: Utilizes information entropy to determine the variables and templates in log messages. The information entropy of tokens at the same location is calculated, and positions with high entropy are identified as variables, while others are fixed components.
3. **Chain-of-Thought Merging**: A three-hop approach using large language models (LLMs) to merge semantically similar but structurally different log templates. This involves:
- **Structure QA**: Examining structural differences.
- **Semantic QA**: Probing semantic equivalences.
- **Solution QA**: Deciding on merging based on prior analyses.
Experiments on public large-scale log datasets demonstrate that LEMUR outperforms existing methods in terms of grouping accuracy and F1 score, achieving state-of-the-art performance and impressive efficiency. The framework is designed to be unsupervised, making it suitable for real-world applications where annotated data is scarce.