Recognition: no theorem link
Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems
Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3
The pith
Fine-tuning an 8B language model on HPC log templates achieves parsing accuracy matching 70B-scale models
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining domain-specific HPC log templates with chain-of-thought instruction examples in a hybrid fine-tuning process, an 8B-parameter LLaMA model reaches parsing accuracy comparable to LLaMA 70B and Claude on diverse datasets from the LogHub repository, and the same model successfully structures over 600 million production logs from the Frontier supercomputer to reveal temporal dynamics, node-level anomalies, and workload-error correlations.
What carries the argument
Hybrid fine-tuning methodology that adapts a general-purpose LLM to HPC log data by mixing domain-specific templates with instruction-tuned chain-of-thought examples for local, privacy-preserving deployment.
If this is right
- Massive volumes of HPC telemetry can be parsed at scale using models small enough to run on-site without cloud access.
- Operational patterns such as node anomalies and workload-error links become extractable from production logs without manual review.
- Log analysis can remain energy-efficient and private because the adapted model stays local rather than relying on larger external services.
Where Pith is reading between the lines
- The same template-plus-instruction adaptation approach could be tested on unstructured logs from other large-scale systems such as cloud platforms or scientific instruments.
- Domain-specific fine-tuning of this kind might reduce the practical need for ever-larger general models when the task is narrowly defined.
- Repeating the Frontier-scale experiment on additional supercomputers would provide a direct test of whether the observed patterns generalize.
Load-bearing premise
That the combination of log templates and instruction examples used in training will let the small model generalize reliably to the full variety of inconsistent formats and edge cases found in real leadership-class HPC logs.
What would settle it
Applying the fine-tuned 8B model to logs from a second leadership-class system whose formats were never seen in training and measuring whether its parsing accuracy falls below that of the 70B baseline or misses documented anomalies.
Figures
read the original abstract
Leadership-class HPC systems generate massive volumes of heterogeneous, largely unstructured system logs. Because these logs originate from diverse software, hardware, and runtime layers, they exhibit inconsistent formats, making structure extraction and pattern discovery extremely challenging. Therefore, robust log parsing and mining is critical to transform this raw telemetry into actionable insights that reveal operational patterns, diagnose anomalies, and enable reliable, efficient, and scalable system analysis. Recent advances in large language models (LLMs) offer a promising new direction for automated log understanding in leadership-class HPC environments. To capitalize on this opportunity, we present a domain-adapted, instruction-following, LLM-driven framework that leverages chain-of-thought (CoT) reasoning to parse and structure HPC logs with high fidelity. Our approach combines domain-specific log-template data with instruction-tuned examples to fine-tune an 8B-parameter LLaMA model tailored for HPC log analysis. We develop a hybrid fine-tuning methodology that adapts a general-purpose LLM to domain-specific log data, enabling privacy-preserving, locally deployable, fast, and energy-efficient log-mining approach. We conduct experiments on a diverse set of log datasets from the LogHub repository. The evaluation confirms that our approach achieves parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic's Claude. We further validate the practical utility of our fine-tuned LLM model by parsing over 600 million production logs from the Frontier supercomputer over a four-week window, uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a domain-adapted, instruction-tuned 8B LLaMA model using chain-of-thought reasoning for parsing unstructured HPC system logs. It claims this model matches the parsing accuracy of much larger models (LLaMA 70B, Claude) on LogHub datasets and demonstrates practical utility by processing 600 million real logs from the Frontier supercomputer to reveal patterns in temporal dynamics, node-level anomalies, and workload-error correlations.
Significance. If the quantitative claims hold with rigorous evaluation, the work would offer a privacy-preserving, locally deployable, and computationally efficient alternative to proprietary large LLMs for log analysis at leadership-class scale. The Frontier-scale experiment, if properly benchmarked, could provide valuable evidence for generalization in heterogeneous production environments.
major comments (2)
- [Abstract] Abstract: The claim of achieving 'parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic's Claude' is stated without any quantitative metrics (e.g., F1 scores, template accuracy), baseline details, error bars, or evaluation protocol, preventing verification of the central accuracy result.
- [Abstract] Abstract (production validation paragraph): The utility demonstration on 600 million Frontier logs reports only qualitative pattern discovery ('uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations') with no accuracy metrics, ground-truth comparison on a held-out subset, error rates, or baseline against Drain/IPLoM or the untuned 8B model; this leaves the generalization claim from LogHub to real heterogeneous HPC logs unsupported.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., F1 on LogHub) to allow readers to assess the 'on par' claim immediately.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the abstract's clarity and the presentation of our production-scale validation. We address each point below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of achieving 'parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic's Claude' is stated without any quantitative metrics (e.g., F1 scores, template accuracy), baseline details, error bars, or evaluation protocol, preventing verification of the central accuracy result.
Authors: We agree that the abstract, being a high-level summary, would benefit from including key quantitative results to support the central claim. The full manuscript reports these details in the Experiments section, including F1 scores, template accuracy, comparisons to LLaMA 70B and Claude, baselines, and evaluation protocols on LogHub datasets. We will revise the abstract to incorporate specific metrics (e.g., average F1 scores and a brief note on the evaluation setup) while maintaining conciseness. revision: yes
-
Referee: [Abstract] Abstract (production validation paragraph): The utility demonstration on 600 million Frontier logs reports only qualitative pattern discovery ('uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations') with no accuracy metrics, ground-truth comparison on a held-out subset, error rates, or baseline against Drain/IPLoM or the untuned 8B model; this leaves the generalization claim from LogHub to real heterogeneous HPC logs unsupported.
Authors: We acknowledge that the abstract describes the Frontier experiment qualitatively. The manuscript details the scale of the analysis and the discovered patterns, but production logs are unlabeled, precluding direct ground-truth accuracy metrics or held-out comparisons for the full dataset. We will revise the abstract to clarify the validation methods employed (e.g., sampling for manual review, cross-referencing with known system events, and limited comparisons to parsers like Drain on subsets) and to note the generalization evidence from observed patterns. This addresses the concern without overstating what is possible with unlabeled data. revision: partial
- Direct quantitative accuracy metrics, ground-truth comparisons, or error rates for the full set of 600 million unlabeled Frontier production logs cannot be provided, as comprehensive labels do not exist for this real-world dataset.
Circularity Check
No significant circularity; empirical evaluation on benchmarks and production logs stands independently
full rationale
The paper describes standard fine-tuning of an 8B LLaMA model on domain log templates plus instruction examples, followed by accuracy evaluation on LogHub datasets (compared to larger models) and separate application to 600M Frontier logs for pattern discovery. No equations, self-definitions, or fitted parameters are presented that reduce predictions to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The production-scale step is an independent deployment rather than a renamed fit, so the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can be effectively adapted to specialized technical domains such as HPC log parsing through instruction tuning and chain-of-thought examples.
Reference graph
Works this paper leans on
-
[1]
From exploration to explanation: Ml-driven causal discovery for datacenter reliability at scale,
P. Prakash, R. P. Hong Enriquez, S. Serebryakov, D. Grant, W. Brewer, and D. Milojicic, “From exploration to explanation: Ml-driven causal discovery for datacenter reliability at scale,” in Proceedings of the SC’25 Workshops of the International Con- ference for High Performance Computing, Networking, Storage and Analysis, pp. 997–1002, 2025
2025
-
[2]
Big data meets hpc log analytics: Scalable approach to under- standing systems at extreme scale,
B. H. Park, S. Hukerikar, R. Adamson, and C. Engelmann, “Big data meets hpc log analytics: Scalable approach to under- standing systems at extreme scale,” in 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 758–765, IEEE, 2017
2017
-
[3]
Drain: An online log parsing approach with fixed depth tree,
P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in 2017 IEEE international conference on web services (ICWS), pp. 33–40, IEEE, 2017
2017
-
[4]
Spell: Streaming parsing of system event logs,
M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 859–864, IEEE, 2016
2016
-
[5]
Logparser-llm: Advancing efficient log parsing with large language models,
A. Zhong, D. Mo, G. Liu, J. Liu, Q. Lu, Q. Zhou, J. Wu, Q. Li, and Q. Wen, “Logparser-llm: Advancing efficient log parsing with large language models,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, (New York, NY, USA), pp. 4559–4570, Association for Computing Machinery, 2024
2024
-
[6]
Divlog: Log parsing with prompt enhanced in-context learning,
J. Xu, R. Yang, Y. Huo, C. Zhang, and P. He, “Divlog: Log parsing with prompt enhanced in-context learning,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–12, 2024
2024
-
[7]
Logan: An llm-based log analytics tool with causal inferencing,
P. Gupta, K. Bhukar, H. Kumar, S. Nagar, P. Mohapatra, and D. Kar, “Logan: An llm-based log analytics tool with causal inferencing,” in ICPE ’25: Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering, ICPE ’25, (New York, NY, USA), p. 54–56, Association for Computing Machinery, 2025
2025
-
[8]
Face it your- selves: An llm-based two-stage strategy to localize configuration errors via logs,
S. Shan, Y. Huo, Y. Su, Y. Li, D. Li, and Z. Zheng, “Face it your- selves: An llm-based two-stage strategy to localize configuration errors via logs,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, (New York, NY, USA), p. 13–25, Association for Computing Machinery, 2024
2024
-
[9]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review arXiv 2021
-
[10]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022
2022
-
[11]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural infor- mation processing systems, vol. 35, pp. 24824–24837, 2022
2022
-
[12]
An analysis of system balance and architectural trends based on top500 supercomputers,
A. Khan, H. Sim, S. S. Vazhkudai, A. R. Butt, and Y. Kim, “An analysis of system balance and architectural trends based on top500 supercomputers,” in The International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia ’21, (New York, NY, USA), p. 11–22, Association for Computing Machinery, 2021
2021
-
[13]
An evaluation of the effect of network cost optimization for leadership class supercomputers,
A. Khan, J. R. Lange, N. Hagerty, E. F. Posada, J. Holmen, J. B. White, A. Harris, V. M. Vergara, C. Zimmer, and S. Atchley, “An evaluation of the effect of network cost optimization for leadership class supercomputers,” in SC24: International Con- ference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16, 2024
2024
-
[14]
Clus- tering event logs using iterative partitioning,
A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “Clus- tering event logs using iterative partitioning,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1255–1264, 2009
2009
-
[15]
Logmine: Fast pattern recognition for log analytics,
H. Hamooni, B. Debnath, J. Xu, H. Zhang, G. Jiang, and A. Mueen, “Logmine: Fast pattern recognition for log analytics,” in Proceedings of the 25th ACM international on conference on information and knowledge management, pp. 1573–1582, 2016
2016
-
[16]
Incremental mining of system log format,
M. Mizutani, “Incremental mining of system log format,” in 2013 IEEE International Conference on Services Computing, pp. 595–602, IEEE, 2013
2013
-
[17]
A search-based approach for accurate identification of log message formats,
S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sas- nauskas, “A search-based approach for accurate identification of log message formats,” in Proceedings of the 26th Conference on Program Comprehension, pp. 167–177, 2018
2018
-
[18]
Logbert: Log anomaly detection via bert,
H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” in 2021 international joint conference on neural networks (IJCNN), pp. 1–8, IEEE, 2021
2021
-
[19]
Deeplog: Anomaly detection and diagnosis from system logs through deep learn- ing,
M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learn- ing,” in Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp. 1285–1298, 2017
2017
-
[20]
Alpaca: A strong, replicable instruction-following model,
R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction-following model,” Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, p. 7, 2023
2023
-
[21]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[22]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
The Falcon Series of Open Language Models , journal =
E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo- caru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malar- tic, et al., “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023
-
[24]
Llmparser: An exploratory study on using large language models for log parsing,
Z. Ma, A. R. Chen, D. J. Kim, T.-H. Chen, and S. Wang, “Llmparser: An exploratory study on using large language models for log parsing,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13, 2024
2024
-
[25]
Frontier: exploring exascale,
S. Atchley, C. Zimmer, J. Lange, D. Bernholdt, V. Melesse Ver- gara, T. Beck, M. Brim, R. Budiardja, S. Chandrasekaran, M. Eisenbach, et al., “Frontier: exploring exascale,” in Pro- ceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16, 2023
2023
-
[26]
Llm-assisted code cleaning for training accurate code generators,
N. Jain, T. Zhang, W.-L. Chiang, J. E. Gonzalez, K. Sen, and I. Stoica, “Llm-assisted code cleaning for training accurate code generators,” arXiv preprint arXiv:2311.14904, 2023
-
[27]
Lora: Low-rank adaptation of large language models.,
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022
2022
-
[28]
Claude 3.5 sonnet model card and technical overview
Anthropic AI, “Claude 3.5 sonnet model card and technical overview. ” https://www.anthropic.com/news/ claude-3-5-sonnet, June 2024. Accessed: 2025-08-14
2024
-
[29]
Meta llama 3 70b model card
Meta AI, “Meta llama 3 70b model card. ” https://huggingface. co/meta-llama/Meta-Llama-3-70B, 2024. Accessed: 2025-08- 14
2024
-
[30]
Meta llama 3 8b model card
Meta AI, “Meta llama 3 8b model card. ” https://huggingface. co/meta-llama/Llama-3.1-8B, 2024. Accessed: 2025-08-14
2024
-
[31]
Loghub: A large collection of system log datasets for ai-driven log analytics,
J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pp. 355–366, IEEE, 2023
2023
-
[32]
A tutorial on kernel density estimation and recent advances,
Y.-C. Chen, “A tutorial on kernel density estimation and recent advances,” Biostatistics & Epidemiology, vol. 1, no. 1, pp. 161– 187, 2017
2017
-
[33]
Hierarchical grouping to optimize an objective function,
J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963
1963
-
[34]
Binary codes capable of correcting deletions, insertions and reversals,
V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp. 707–710, 1966
1966
-
[35]
Word error rate estimation for speech recognition: e-wer,
A. Ali and S. Renals, “Word error rate estimation for speech recognition: e-wer,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 20–24, Association for Computational Lin- guistics, 2018
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.