pith. sign in

arxiv: 2605.16347 · v1 · pith:M4HVZPJJnew · submitted 2026-05-08 · 💻 cs.LG

HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support

Pith reviewed 2026-05-20 22:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords supportadaptationclustercomputingdomainmodeloperationaladapted
0
0 comments X

The pith

An 8B LLM adapted for HPC tasks performs like much larger models but uses less memory and runs faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-performance computing clusters present operational challenges for many researchers who need help with job schedulers, parallel frameworks, and resource management. General-purpose LLMs often fall short because they lack specialized knowledge of HPC environments. This paper shows how to build an effective support assistant by ingesting public HPC documentation, creating synthetic training examples, and applying lightweight fine-tuning with QLoRA to an 8B model. The resulting system, paired with retrieval, achieves results close to those of 14B-scale models while requiring far fewer computational resources during use.

Core claim

The central claim is that domain adaptation of Llama 3.1 8B via QLoRA on an HPC corpus of 9,000-24,000 examples, when combined with retrieval-augmented generation, yields a practical assistant for Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting that approaches the performance of larger general-purpose models such as Qwen 2.5 14B under lower GPU memory and latency constraints.

What carries the argument

QLoRA-based lightweight domain adaptation of an 8B Llama model on a curated HPC corpus, integrated with dense retrieval for context-aware responses.

Load-bearing premise

That the constructed HPC corpus and the specific evaluation cases on JetStream2 sufficiently represent the diversity of real-world HPC user needs and cluster environments.

What would settle it

Observing a significant performance drop when the model is tested on HPC queries from a previously unseen university cluster or with novel troubleshooting scenarios not covered in the training corpus.

Figures

Figures reproduced from arXiv: 2605.16347 by Izzat Alsmadi, Nourin Shahin.

Figure 1
Figure 1. Figure 1: HPC-LLM architecture. Row 1: user-facing API and dashboard. Row 2: Orchestrator, which coordinates the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Modern scientific research increasingly depends on High-Performance Computing (HPC) infrastructures, yet many researchers face significant operational barriers when interacting with cluster environments, job schedulers, GPU resources, and parallel computing frameworks. General-purpose large language models (LLMs) provide useful coding assistance but often lack the domain-specific operational knowledge required for reliable HPC support. This paper presents HPC-LLM, a retrieval augmented and domain-adapted assistant designed to support common HPC workflows including Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting. The proposed framework integrates automated documentation ingestion, dense retrieval, lightweight domain adaptation using QLoRA, and local inference within a modular orchestration pipeline. To support domain adaptation, we construct an HPC-oriented corpus from publicly available university HPC documentation, curated operational examples, and synthetic instruction-answer pairs generated from retrieved HPC content. The resulting dataset contains approximately 9,000 to 24,000 HPC-focused training examples spanning job scheduling, GPU computing, distributed training, storage systems, and cluster administration topics. We fine-tune Llama 3.1 8B using QLoRA and evaluate the resulting model against several open weight baselines under retrieval-augmented settings on JetStream2 infrastructure. Experimental results indicate that the adapted 8B model achieves performance comparable to substantially larger general-purpose models while operating under significantly lower GPU memory requirements and inference latency. In particular, the adapted model approaches the performance of Qwen 2.5 14B while requiring substantially fewer computational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents HPC-LLM, a retrieval-augmented and domain-adapted LLM assistant for HPC workflows such as Slurm scheduling, MPI, GPU utilization, and cluster troubleshooting. It builds an HPC corpus of 9,000–24,000 examples from public documentation, operational cases, and synthetic pairs; applies QLoRA adaptation to Llama 3.1 8B; and reports that the resulting model achieves performance comparable to larger general-purpose models (e.g., approaching Qwen 2.5 14B) while using substantially lower GPU memory and latency, evaluated under RAG settings on JetStream2 infrastructure.

Significance. If the central comparability claim holds under properly controlled conditions, the work would offer a practical, resource-efficient path to domain-specific HPC support that lowers barriers for researchers without access to large-scale inference hardware. The modular pipeline combining automated ingestion, dense retrieval, and lightweight adaptation is a concrete contribution to applied LLM deployment in scientific computing.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the claim that the adapted 8B model 'approaches the performance of Qwen 2.5 14B' is load-bearing for the paper's central contribution, yet the manuscript provides no quantitative metrics, error bars, exact baseline configurations, retrieval-quality measurements, or statistical significance tests. Without these, the comparability result cannot be assessed or reproduced.
  2. [Data construction / Evaluation] Data construction and Evaluation sections: the corpus combines public documentation with synthetic instruction-answer pairs, but the manuscript does not describe train/test split construction, decontamination procedures, or whether evaluation queries on JetStream2 were drawn from an independent external source. This leaves open the possibility that reported gains reflect data overlap rather than genuine domain adaptation, directly undermining the fairness of the comparison to external baselines.
minor comments (2)
  1. [Abstract] The range 'approximately 9,000 to 24,000' for the training corpus size should be replaced by a single precise figure or a clear breakdown by source.
  2. [Framework description] Clarify the exact retrieval model, embedding dimension, and top-k value used in the RAG pipeline, as these parameters directly affect the reported inference latency and memory figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving the rigor and reproducibility of our claims. We address each major comment point by point below and have made revisions to the manuscript where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim that the adapted 8B model 'approaches the performance of Qwen 2.5 14B' is load-bearing for the paper's central contribution, yet the manuscript provides no quantitative metrics, error bars, exact baseline configurations, retrieval-quality measurements, or statistical significance tests. Without these, the comparability result cannot be assessed or reproduced.

    Authors: We agree that the current manuscript lacks sufficient quantitative detail to fully support the comparability claim. In the revised version, we will expand the Evaluation section to report concrete metrics (e.g., accuracy or success rate on HPC task categories), error bars derived from multiple independent runs, exact baseline configurations including retrieval parameters and prompting strategies, retrieval-quality measurements such as recall@5 and nDCG, and results of statistical significance tests comparing the adapted model to Qwen 2.5 14B. These additions will be placed in both the abstract summary and the main evaluation tables. revision: yes

  2. Referee: [Data construction / Evaluation] Data construction and Evaluation sections: the corpus combines public documentation with synthetic instruction-answer pairs, but the manuscript does not describe train/test split construction, decontamination procedures, or whether evaluation queries on JetStream2 were drawn from an independent external source. This leaves open the possibility that reported gains reflect data overlap rather than genuine domain adaptation, directly undermining the fairness of the comparison to external baselines.

    Authors: We acknowledge that the absence of explicit data-handling details creates ambiguity regarding potential contamination. We will revise the Data construction and Evaluation sections to describe the train/test split procedure (including the 80/20 ratio and hold-out criteria), decontamination steps (e.g., embedding-based similarity filtering to remove near-duplicates between training examples and evaluation queries), and confirmation that the JetStream2 evaluation queries were collected from live operational logs and user-submitted tickets that were never used in corpus construction or synthetic pair generation. This will be supported by a new subsection on data provenance and leakage prevention. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation and external benchmarking

full rationale

The paper constructs an HPC corpus from public documentation, operational examples, and synthetic pairs, applies QLoRA fine-tuning to Llama 3.1 8B, and reports performance on JetStream2 against independent open-weight baselines such as Qwen 2.5 14B. No equations, predictions, or first-principles derivations are present that reduce reported gains to quantities defined by the paper's own fitted parameters or self-citations. Evaluation uses external infrastructure and models, satisfying the criterion for self-contained results against external benchmarks. No load-bearing steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that public HPC documentation plus synthetic pairs yield high-quality training data and that retrieval-augmented inference on JetStream2 reflects real operational value; no new physical entities or mathematical axioms are introduced.

free parameters (1)
  • QLoRA adaptation hyperparameters
    Rank, alpha, and dropout values for the lightweight fine-tuning step are not specified in the abstract and must be chosen to achieve the reported performance.
axioms (1)
  • domain assumption Publicly available university HPC documentation combined with synthetic instruction pairs is sufficient to capture the operational knowledge needed for reliable cluster support.
    This premise underpins the construction of the 9,000-24,000 example training set and the claim of effective domain adaptation.

pith-pipeline@v0.9.0 · 5805 in / 1554 out tokens · 44466 ms · 2026-05-20T22:23:35.403833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 15 internal anchors

  1. [2]

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models.arXiv.https://arxiv.org/abs/2302.13971

  2. [7]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., . . . Zaremba, W. (2021). Evaluating large language models trained on code.arXiv. https://arxiv.org/abs/ 2107.03374

  3. [11]

    Lin, C.-Y . (2004). ROUGE: A package for automatic evaluation of summaries. InProceedings of the ACL Workshop: Text Summarization Branches Out(pp. 74–81). Association for Computational Linguistics

  4. [12]

    Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). C-Pack: Packaged resources to advance general Chinese embedding.arXiv.https://arxiv.org/abs/2309.07597

  5. [15]

    Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., & Yih, W. (2023). REPLUG: Retrieval-augmented black-box language models.arXiv.https://arxiv.org/abs/2301.12652

  6. [16]

    O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

    Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(pp. 1–22). ACM.https://doi.org/10.1145/3586183.3606763

  7. [17]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv.https://arxiv.org/abs/2308.08155

  8. [18]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(pp. 611–626). ACM. https://doi.org/10.1145/3600006. 3613165

  9. [19]

    (2023).Chroma: The AI-native open-source embedding database[Software]

    Chroma. (2023).Chroma: The AI-native open-source embedding database[Software]. https://www. trychroma.com/

  10. [20]

    (2022).TRL: Transformer reinforcement learning[Software]

    von Werra, L., Belkada, Y ., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., & Huang, S. (2022).TRL: Transformer reinforcement learning[Software]. GitHub.https://github.com/huggingface/trl

  11. [21]

    (2018).FastAPI[Software].https://fastapi.tiangolo.com/

    Ramírez, S. (2018).FastAPI[Software].https://fastapi.tiangolo.com/

  12. [22]

    A., Boerner, T

    Stewart, C. A., Boerner, T. M., Hazlewood, V ., Snapp-Childs, W., Vaughn, M., Marru, S., Coulter, J. E., Grimshaw, M., Skousen, P., Dick, S., Merchant, N., & Skidmore, E. (2021). Jetstream2: Accelerating cloud computing via Jetstream. InProceedings of the Practice and Experience in Advanced Research Computing(pp. 1–8). ACM. https://doi.org/10.1145/3437359.3465565

  13. [23]

    Language Models are Few-Shot Learners

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., & Amodei, D. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems,3...

  14. [24]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288. https://arxiv.o...

  15. [25]

    Shaik, K., Wang, D., Zheng, W., & others. (2024). S3LLM: Large-scale scientific software understanding with LLMs using source, metadata, and document. InInternational Conference on Computational Science(pp. 391–405). Springer.https://doi.org/10.1007/978-3-031-63759-9_27

  16. [26]

    Nguyen, Z., Annunziata, A., Luong, V ., & others. (2024). Enhancing Q&A with domain-specific fine-tuning and iterative reasoning: A comparative study.arXiv preprint arXiv:2404.11792. https://arxiv.org/abs/2404. 11792

  17. [27]

    Code Llama: Open Foundation Models for Code

    Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., & Synnaeve, G. (2023). Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950. https://arxiv.org/abs/...

  18. [28]

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., & Lin, J. (2024). Qwen2.5-Coder technical report.arXiv preprint arXiv:2409.12186.https://arxiv.org/abs/2409.12186 11 APREPRINT- MAY19, 2026

  19. [29]

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems,33, 9459–9474.https://arxiv.org/abs/2005.11401

  20. [30]

    Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020). REALM: Retrieval-augmented language model pre-training. InProceedings of the International Conference on Machine Learning(pp. 3929–3938). PMLR. https://arxiv.org/abs/2002.08909

  21. [31]

    Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., & Grave, E. (2023). Atlas: Few-shot learning with retrieval augmented language models.Journal of Machine Learning Research,24(251), 1–43.https://arxiv.org/abs/2208.03299

  22. [32]

    Wang, C., Long, Q., Xiao, M., & others. (2024). BioRAG: A RAG-LLM framework for biological question reasoning.arXiv preprint arXiv:2408.01107.https://arxiv.org/abs/2408.01107

  23. [33]

    C., Grantcharov, V ., Wanna, S., & others

    Barron, R. C., Grantcharov, V ., Wanna, S., & others. (2024). Domain-specific retrieval-augmented generation using vector stores, knowledge graphs, and tensor factorization. InIEEE International Conference on Machine Learning and Applications.https://doi.org/10.1109/ICMLA61862.2024.00258

  24. [34]

    H., Chan, H., Vriza, A., & others

    Prince, M. H., Chan, H., Vriza, A., & others. (2024). Opportunities for retrieval and tool augmented large language models in scientific facilities.npj Computational Materials,10(1). https://doi.org/10.1038/ s41524-024-01423-2

  25. [35]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. InProceedings of the International Conference on Learning Representations. https://arxiv.org/abs/2106.09685

  26. [36]

    Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems,36.https://arxiv.org/abs/2305.14314

  27. [37]

    Raft: Adapting language model to domain specific rag,

    Zhang, T., Patil, S. G., Jain, N., & others. (2024). RAFT: Adapting language model to domain specific RAG.arXiv preprint arXiv:2403.10131.https://arxiv.org/abs/2403.10131

  28. [38]

    Li, J., Yuan, Y ., & Zhang, Z. (2024). Enhancing LLM factual accuracy with RAG to counter hallucinations.arXiv preprint arXiv:2403.10446.https://arxiv.org/abs/2403.10446

  29. [39]

    Miyashita, Y ., Tung, P. K. M., & Barthélemy, J. (2025). LLM as HPC expert: Extending RAG architecture for HPC data.arXiv preprint arXiv:2501.14733.https://arxiv.org/abs/2501.14733

  30. [40]

    Gokdemir, O., Siebenschuh, C., Brace, A., & others. (2025). HiPerRAG: High-performance retrieval augmented generation for scientific insights.arXiv preprint arXiv:2505.04846.https://arxiv.org/abs/2505.04846

  31. [41]

    Zhang, T., Jiang, Z., Bai, S., & others. (2024). RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance.arXiv preprint arXiv:2410.15805. https://arxiv.org/abs/ 2410.15805

  32. [42]

    BERTScore: Evaluating Text Generation with BERT

    Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., & Artzi, Y . (2020). BERTScore: Evaluating text generation with BERT. InProceedings of the International Conference on Learning Representations. https://arxiv.org/ abs/1904.09675 12