pith. machine review for the scientific record. sign in

arxiv: 2604.05168 · v1 · submitted 2026-04-06 · 💻 cs.AI

Recognition: no theorem link

Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.AI
keywords log parsinginstruction tuninglarge language modelsHPC systemssystem logsanomaly detectionchain of thoughtFrontier supercomputer
0
0 comments X

The pith

Fine-tuning an 8B language model on HPC log templates achieves parsing accuracy matching 70B-scale models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt a relatively small language model to extract structure from the inconsistent, multi-source logs generated by leadership-class supercomputers. It does this by combining real log templates with instruction examples that guide step-by-step reasoning, then fine-tunes an 8B LLaMA model on that material. The resulting model matches the accuracy of much larger general models on standard log-parsing benchmarks. When run on more than 600 million actual logs from the Frontier supercomputer, it surfaces patterns in timing, node failures, and workload errors. This line of work matters because HPC facilities produce far more raw telemetry than humans can review, so reliable automated parsing turns that data into usable operational knowledge while keeping the model small enough to run locally.

Core claim

By combining domain-specific HPC log templates with chain-of-thought instruction examples in a hybrid fine-tuning process, an 8B-parameter LLaMA model reaches parsing accuracy comparable to LLaMA 70B and Claude on diverse datasets from the LogHub repository, and the same model successfully structures over 600 million production logs from the Frontier supercomputer to reveal temporal dynamics, node-level anomalies, and workload-error correlations.

What carries the argument

Hybrid fine-tuning methodology that adapts a general-purpose LLM to HPC log data by mixing domain-specific templates with instruction-tuned chain-of-thought examples for local, privacy-preserving deployment.

If this is right

  • Massive volumes of HPC telemetry can be parsed at scale using models small enough to run on-site without cloud access.
  • Operational patterns such as node anomalies and workload-error links become extractable from production logs without manual review.
  • Log analysis can remain energy-efficient and private because the adapted model stays local rather than relying on larger external services.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same template-plus-instruction adaptation approach could be tested on unstructured logs from other large-scale systems such as cloud platforms or scientific instruments.
  • Domain-specific fine-tuning of this kind might reduce the practical need for ever-larger general models when the task is narrowly defined.
  • Repeating the Frontier-scale experiment on additional supercomputers would provide a direct test of whether the observed patterns generalize.

Load-bearing premise

That the combination of log templates and instruction examples used in training will let the small model generalize reliably to the full variety of inconsistent formats and edge cases found in real leadership-class HPC logs.

What would settle it

Applying the fine-tuned 8B model to logs from a second leadership-class system whose formats were never seen in training and measuring whether its parsing accuracy falls below that of the 70B baseline or misses documented anomalies.

Figures

Figures reproduced from arXiv: 2604.05168 by Ahmad Maroof Karimi, Awais Khan, Charles Qing Cao, Jong Youl Choi.

Figure 1
Figure 1. Figure 1: Sources of logs in leadership-class HPC systems. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of log parsing task. Given a set of raw log [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of our proposed hybrid approach. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inference workflow of the proposed instruction-following [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Simplified example prompt with instructions for log template [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: By combining representative examples, explicit [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation on sampled Frontier system logs. The base models [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full experimental results for the fine-tuned LLaMA 8B model [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 12
Figure 12. Figure 12: The distribution of Frontier compute node-hour over the four [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: System logs showing cascading events. The timeline [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Correlation of scientific domain workload with system [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: 28 log categories and 30 science domains are reordered using [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: Kernel density estimation [32] of interconnect error dis [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
read the original abstract

Leadership-class HPC systems generate massive volumes of heterogeneous, largely unstructured system logs. Because these logs originate from diverse software, hardware, and runtime layers, they exhibit inconsistent formats, making structure extraction and pattern discovery extremely challenging. Therefore, robust log parsing and mining is critical to transform this raw telemetry into actionable insights that reveal operational patterns, diagnose anomalies, and enable reliable, efficient, and scalable system analysis. Recent advances in large language models (LLMs) offer a promising new direction for automated log understanding in leadership-class HPC environments. To capitalize on this opportunity, we present a domain-adapted, instruction-following, LLM-driven framework that leverages chain-of-thought (CoT) reasoning to parse and structure HPC logs with high fidelity. Our approach combines domain-specific log-template data with instruction-tuned examples to fine-tune an 8B-parameter LLaMA model tailored for HPC log analysis. We develop a hybrid fine-tuning methodology that adapts a general-purpose LLM to domain-specific log data, enabling privacy-preserving, locally deployable, fast, and energy-efficient log-mining approach. We conduct experiments on a diverse set of log datasets from the LogHub repository. The evaluation confirms that our approach achieves parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic's Claude. We further validate the practical utility of our fine-tuned LLM model by parsing over 600 million production logs from the Frontier supercomputer over a four-week window, uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a domain-adapted, instruction-tuned 8B LLaMA model using chain-of-thought reasoning for parsing unstructured HPC system logs. It claims this model matches the parsing accuracy of much larger models (LLaMA 70B, Claude) on LogHub datasets and demonstrates practical utility by processing 600 million real logs from the Frontier supercomputer to reveal patterns in temporal dynamics, node-level anomalies, and workload-error correlations.

Significance. If the quantitative claims hold with rigorous evaluation, the work would offer a privacy-preserving, locally deployable, and computationally efficient alternative to proprietary large LLMs for log analysis at leadership-class scale. The Frontier-scale experiment, if properly benchmarked, could provide valuable evidence for generalization in heterogeneous production environments.

major comments (2)
  1. [Abstract] Abstract: The claim of achieving 'parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic's Claude' is stated without any quantitative metrics (e.g., F1 scores, template accuracy), baseline details, error bars, or evaluation protocol, preventing verification of the central accuracy result.
  2. [Abstract] Abstract (production validation paragraph): The utility demonstration on 600 million Frontier logs reports only qualitative pattern discovery ('uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations') with no accuracy metrics, ground-truth comparison on a held-out subset, error rates, or baseline against Drain/IPLoM or the untuned 8B model; this leaves the generalization claim from LogHub to real heterogeneous HPC logs unsupported.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., F1 on LogHub) to allow readers to assess the 'on par' claim immediately.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the abstract's clarity and the presentation of our production-scale validation. We address each point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of achieving 'parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic's Claude' is stated without any quantitative metrics (e.g., F1 scores, template accuracy), baseline details, error bars, or evaluation protocol, preventing verification of the central accuracy result.

    Authors: We agree that the abstract, being a high-level summary, would benefit from including key quantitative results to support the central claim. The full manuscript reports these details in the Experiments section, including F1 scores, template accuracy, comparisons to LLaMA 70B and Claude, baselines, and evaluation protocols on LogHub datasets. We will revise the abstract to incorporate specific metrics (e.g., average F1 scores and a brief note on the evaluation setup) while maintaining conciseness. revision: yes

  2. Referee: [Abstract] Abstract (production validation paragraph): The utility demonstration on 600 million Frontier logs reports only qualitative pattern discovery ('uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations') with no accuracy metrics, ground-truth comparison on a held-out subset, error rates, or baseline against Drain/IPLoM or the untuned 8B model; this leaves the generalization claim from LogHub to real heterogeneous HPC logs unsupported.

    Authors: We acknowledge that the abstract describes the Frontier experiment qualitatively. The manuscript details the scale of the analysis and the discovered patterns, but production logs are unlabeled, precluding direct ground-truth accuracy metrics or held-out comparisons for the full dataset. We will revise the abstract to clarify the validation methods employed (e.g., sampling for manual review, cross-referencing with known system events, and limited comparisons to parsers like Drain on subsets) and to note the generalization evidence from observed patterns. This addresses the concern without overstating what is possible with unlabeled data. revision: partial

standing simulated objections not resolved
  • Direct quantitative accuracy metrics, ground-truth comparisons, or error rates for the full set of 600 million unlabeled Frontier production logs cannot be provided, as comprehensive labels do not exist for this real-world dataset.

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on benchmarks and production logs stands independently

full rationale

The paper describes standard fine-tuning of an 8B LLaMA model on domain log templates plus instruction examples, followed by accuracy evaluation on LogHub datasets (compared to larger models) and separate application to 600M Frontier logs for pattern discovery. No equations, self-definitions, or fitted parameters are presented that reduce predictions to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The production-scale step is an independent deployment rather than a renamed fit, so the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that LLMs can be domain-adapted via instruction tuning without introducing new free parameters or invented entities beyond the base model and existing log data.

axioms (1)
  • domain assumption Large language models can be effectively adapted to specialized technical domains such as HPC log parsing through instruction tuning and chain-of-thought examples.
    This premise underpins the hybrid fine-tuning methodology described in the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1201 out tokens · 64850 ms · 2026-05-10T18:48:33.342424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    From exploration to explanation: Ml-driven causal discovery for datacenter reliability at scale,

    P. Prakash, R. P. Hong Enriquez, S. Serebryakov, D. Grant, W. Brewer, and D. Milojicic, “From exploration to explanation: Ml-driven causal discovery for datacenter reliability at scale,” in Proceedings of the SC’25 Workshops of the International Con- ference for High Performance Computing, Networking, Storage and Analysis, pp. 997–1002, 2025

  2. [2]

    Big data meets hpc log analytics: Scalable approach to under- standing systems at extreme scale,

    B. H. Park, S. Hukerikar, R. Adamson, and C. Engelmann, “Big data meets hpc log analytics: Scalable approach to under- standing systems at extreme scale,” in 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 758–765, IEEE, 2017

  3. [3]

    Drain: An online log parsing approach with fixed depth tree,

    P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in 2017 IEEE international conference on web services (ICWS), pp. 33–40, IEEE, 2017

  4. [4]

    Spell: Streaming parsing of system event logs,

    M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 859–864, IEEE, 2016

  5. [5]

    Logparser-llm: Advancing efficient log parsing with large language models,

    A. Zhong, D. Mo, G. Liu, J. Liu, Q. Lu, Q. Zhou, J. Wu, Q. Li, and Q. Wen, “Logparser-llm: Advancing efficient log parsing with large language models,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, (New York, NY, USA), pp. 4559–4570, Association for Computing Machinery, 2024

  6. [6]

    Divlog: Log parsing with prompt enhanced in-context learning,

    J. Xu, R. Yang, Y. Huo, C. Zhang, and P. He, “Divlog: Log parsing with prompt enhanced in-context learning,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–12, 2024

  7. [7]

    Logan: An llm-based log analytics tool with causal inferencing,

    P. Gupta, K. Bhukar, H. Kumar, S. Nagar, P. Mohapatra, and D. Kar, “Logan: An llm-based log analytics tool with causal inferencing,” in ICPE ’25: Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering, ICPE ’25, (New York, NY, USA), p. 54–56, Association for Computing Machinery, 2025

  8. [8]

    Face it your- selves: An llm-based two-stage strategy to localize configuration errors via logs,

    S. Shan, Y. Huo, Y. Su, Y. Li, D. Li, and Z. Zheng, “Face it your- selves: An llm-based two-stage strategy to localize configuration errors via logs,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, (New York, NY, USA), p. 13–25, Association for Computing Machinery, 2024

  9. [9]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652, 2021

  10. [10]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

  11. [11]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural infor- mation processing systems, vol. 35, pp. 24824–24837, 2022

  12. [12]

    An analysis of system balance and architectural trends based on top500 supercomputers,

    A. Khan, H. Sim, S. S. Vazhkudai, A. R. Butt, and Y. Kim, “An analysis of system balance and architectural trends based on top500 supercomputers,” in The International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia ’21, (New York, NY, USA), p. 11–22, Association for Computing Machinery, 2021

  13. [13]

    An evaluation of the effect of network cost optimization for leadership class supercomputers,

    A. Khan, J. R. Lange, N. Hagerty, E. F. Posada, J. Holmen, J. B. White, A. Harris, V. M. Vergara, C. Zimmer, and S. Atchley, “An evaluation of the effect of network cost optimization for leadership class supercomputers,” in SC24: International Con- ference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16, 2024

  14. [14]

    Clus- tering event logs using iterative partitioning,

    A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “Clus- tering event logs using iterative partitioning,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1255–1264, 2009

  15. [15]

    Logmine: Fast pattern recognition for log analytics,

    H. Hamooni, B. Debnath, J. Xu, H. Zhang, G. Jiang, and A. Mueen, “Logmine: Fast pattern recognition for log analytics,” in Proceedings of the 25th ACM international on conference on information and knowledge management, pp. 1573–1582, 2016

  16. [16]

    Incremental mining of system log format,

    M. Mizutani, “Incremental mining of system log format,” in 2013 IEEE International Conference on Services Computing, pp. 595–602, IEEE, 2013

  17. [17]

    A search-based approach for accurate identification of log message formats,

    S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sas- nauskas, “A search-based approach for accurate identification of log message formats,” in Proceedings of the 26th Conference on Program Comprehension, pp. 167–177, 2018

  18. [18]

    Logbert: Log anomaly detection via bert,

    H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” in 2021 international joint conference on neural networks (IJCNN), pp. 1–8, IEEE, 2021

  19. [19]

    Deeplog: Anomaly detection and diagnosis from system logs through deep learn- ing,

    M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learn- ing,” in Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp. 1285–1298, 2017

  20. [20]

    Alpaca: A strong, replicable instruction-following model,

    R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction-following model,” Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, p. 7, 2023

  21. [21]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

  23. [23]

    The Falcon Series of Open Language Models , journal =

    E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo- caru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malar- tic, et al., “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023

  24. [24]

    Llmparser: An exploratory study on using large language models for log parsing,

    Z. Ma, A. R. Chen, D. J. Kim, T.-H. Chen, and S. Wang, “Llmparser: An exploratory study on using large language models for log parsing,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13, 2024

  25. [25]

    Frontier: exploring exascale,

    S. Atchley, C. Zimmer, J. Lange, D. Bernholdt, V. Melesse Ver- gara, T. Beck, M. Brim, R. Budiardja, S. Chandrasekaran, M. Eisenbach, et al., “Frontier: exploring exascale,” in Pro- ceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16, 2023

  26. [26]

    Llm-assisted code cleaning for training accurate code generators,

    N. Jain, T. Zhang, W.-L. Chiang, J. E. Gonzalez, K. Sen, and I. Stoica, “Llm-assisted code cleaning for training accurate code generators,” arXiv preprint arXiv:2311.14904, 2023

  27. [27]

    Lora: Low-rank adaptation of large language models.,

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022

  28. [28]

    Claude 3.5 sonnet model card and technical overview

    Anthropic AI, “Claude 3.5 sonnet model card and technical overview. ” https://www.anthropic.com/news/ claude-3-5-sonnet, June 2024. Accessed: 2025-08-14

  29. [29]

    Meta llama 3 70b model card

    Meta AI, “Meta llama 3 70b model card. ” https://huggingface. co/meta-llama/Meta-Llama-3-70B, 2024. Accessed: 2025-08- 14

  30. [30]

    Meta llama 3 8b model card

    Meta AI, “Meta llama 3 8b model card. ” https://huggingface. co/meta-llama/Llama-3.1-8B, 2024. Accessed: 2025-08-14

  31. [31]

    Loghub: A large collection of system log datasets for ai-driven log analytics,

    J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pp. 355–366, IEEE, 2023

  32. [32]

    A tutorial on kernel density estimation and recent advances,

    Y.-C. Chen, “A tutorial on kernel density estimation and recent advances,” Biostatistics & Epidemiology, vol. 1, no. 1, pp. 161– 187, 2017

  33. [33]

    Hierarchical grouping to optimize an objective function,

    J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963

  34. [34]

    Binary codes capable of correcting deletions, insertions and reversals,

    V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, no. 8, pp. 707–710, 1966

  35. [35]

    Word error rate estimation for speech recognition: e-wer,

    A. Ali and S. Renals, “Word error rate estimation for speech recognition: e-wer,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 20–24, Association for Computational Lin- guistics, 2018