arxiv: 2604.22688 · v2 · submitted 2026-04-24 · 💻 cs.PF

Recognition: unknown

COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC

Ankur Lahiry, Banooqa Banday, Mohammad Zaeed, Tanzima Z. Islam, Yugesh Bhattarai

Pith reviewed 2026-05-08 08:48 UTC · model grok-4.3

classification 💻 cs.PF

keywords HPC configuration tuningmachine learning on tracesperformance trade-offsdecision intelligencejob scheduling simulatoruncertainty quantificationminimal configuration changestrace-driven recommendations

0 comments

The pith

COMPASS turns operational traces into ML models that recommend minimal HPC configuration changes and cut job turnaround time by 65.93 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that HPC systems can be tuned more effectively by formalizing common configuration questions into reusable query patterns and solving them as machine learning tasks on existing operational traces. The resulting engine supplies recommendations together with evidence of their trustworthiness and guidance on what to measure next when confidence is low. A sympathetic reader would care because current autotuners often ignore domain constraints and fail to identify the smallest adjustment that moves a near-miss configuration onto a desired performance target, leaving expensive hardware under-utilized.

Core claim

COMPASS is a modular, programmable engine that uses operational traces to generate HPC configuration recommendations by formulating query patterns as machine learning tasks, quantifies trustworthiness through evidence and uncertainty measures, and supplies guidance on subsequent configurations when confidence is low. When integrated with an open-source HPC scheduling simulator, it reduces average job turnaround time by 65.93 percent and node usage by 80.93 percent relative to the state of the art, trains up to 100 times faster and infers up to 80 times faster than generative baselines, and scales to traces containing 1.3 billion samples.

What carries the argument

The interactive decision-making engine that maps formalized configuration query patterns to machine learning tasks trained on operational traces while returning uncertainty estimates and next-experiment suggestions.

If this is right

Users receive concrete, minimal-change guidance instead of full re-tuning for near-miss configurations.
Recommendations include explicit uncertainty measures so operators know when to trust or verify them.
The engine scales to multi-gigabyte traces and delivers results orders of magnitude faster than prior generative methods.
Validation against analytical ground truth, reproduction of published results, and real-hardware runs supports transfer to live systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-pattern-plus-ML approach could be applied to configuration problems in cloud platforms or large-scale storage systems that also face multi-objective trade-offs.
Live integration with monitoring streams would allow the engine to update its models continuously without requiring periodic full retraining.
The uncertainty output could be used to prioritize which new configurations to run next in an active learning loop.

Load-bearing premise

Operational traces collected from existing runs must be representative of the full range of configurations and system constraints so that models trained on them generalize to unseen near-miss configurations.

What would settle it

Running COMPASS on a production HPC system whose workload distribution and hardware constraints differ substantially from the training traces and checking whether the reported reductions in turnaround time and node usage still appear.

Figures

Figures reproduced from arXiv: 2604.22688 by Ankur Lahiry, Banooqa Banday, Mohammad Zaeed, Tanzima Z. Islam, Yugesh Bhattarai.

**Figure 1.** Figure 1: Overview of COMPASS. COMPASS supports users by taking a dataset and query, mapping the request into a formal decision task, generating feasible answers, assessing their reliability, and returning an actionable recommendation with explanation. If the user provides a pre-trained performance model, it is used directly as the surrogate; otherwise, COMPASS constructs one using subset sampling and surrogate mod… view at source ↗

**Figure 2.** Figure 2: Unified C 3G objective (top) and explicit definitions of each loss component (bottom). a weighted feature-wise distance. The assumption is that this distance is a practical proxy for implementability, so configurations that differ less from the baseline are treated as easier to apply (e.g., adding 4 nodes rather than 1024). The next requirement is feasibility: a configuration is not useful if it violates u… view at source ↗

**Figure 3.** Figure 3: shows how COMPASS presents trustworthiness in a real-data example. For each query, COMPASS returns the top-γ configurations together with predicted targets, trustworthiness labels, OOD scores, and UQ scores. In this example, the topranked configuration is labeled trusted because its OOD and UQ scores are below the unreliability thresholds and it has more than 380 supporting samples. Each returned configu-… view at source ↗

**Figure 4.** Figure 4: Illustrative Example: Observed data points and gener view at source ↗

**Figure 5.** Figure 5: Example user–COMPASS interaction showing how natural-language intent is refined through clarification into a structured recommend query; the same interaction pattern applies to reconfigure and what-if queries. The larger view of the GUI is presented in view at source ↗

**Figure 6.** Figure 6: COMPASS’s chat bot interface. User (a) selects a query type: recommend, reconfigure, or what-if, (b) selects a dataset in csv or parquet format, (c) drops unnecessary columns, (d) responds to questions about specifying the target objectives. CoMD (0.03 MB) PM-100 (8.55 MB) BUTTER-E (15.70MB) Monet (112.38MB) MIT (126GB) Dataset (File Size) 0 50 100 150 200 250 300 350 Time (seconds) Pre: 0.03s Train: 21.08… view at source ↗

**Figure 7.** Figure 7: Overhead of COMPASS. TABLE VII: Ablation of C 3G loss components across query types. Best results in Blue. For OOD and UQ, [0, 0.95) = trusted, [0.95, 0.99) = caution, ≥0.99 = unsupported. λprox λdiv λcons PM ↓ OOD ↓ UQ ↓ Violations ↓ recommend × × × 0.31±0.14 0.53±0.04 0.78±0.02 20% ✓ × × 0.15±0.07 0.46±0.04 0.58±0.04 0% × ✓ × 0.31±0.14 0.53±0.04 0.78±0.02 20% × × ✓ 0.17±0.05 0.50±0.03 0.72±0.04 0% × ✓ ✓ … view at source ↗

read the original abstract

HPC systems expose many configuration parameters that jointly drive competing objectives. Existing tools such as autotuners recommend good configurations but do not identify minimal changes for a near-miss configuration to meet a performance objective, and they often ignore domain-specific constraints. To address this gap, we introduce COMPASS -- a modular, programmable engine that uses operational traces to generate HPC configuration recommendations and guide tuning decisions. This paper: (1) formalizes configuration questions into query patterns; (2) develops an interactive decision-making engine that formulates these queries as Machine Learning (ML) tasks; (3) quantifies the trustworthiness of its recommendations by providing evidence and quantifying uncertainty, and -- when confidence is low -- provides guidance on which configurations to run next. We validate COMPASS using analytical ground truth, reconstruction accuracy, reproduction of published findings, and when possible, running on real hardware. When integrated with an open-source HPC scheduling simulator, COMPASS cuts average job turnaround time by 65.93% and node usage by 80.93% relative to the state-of-the-art. Moreover, COMPASS achieves up to 100x faster training and 80x faster inference than state-of-the-art generative methods, and scales to traces with 1.3B samples and 126GB of data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COMPASS frames HPC config tuning as ML queries with uncertainty and next-step guidance, but the large simulator gains rest on unverified trace representativeness.

read the letter

The main thing here is that COMPASS turns configuration questions into ML tasks, adds uncertainty estimates, and suggests what to run next when confidence is low. It then plugs the recommendations into an open-source scheduler simulator and reports 65.93% lower average turnaround time and 80.93% lower node usage than the state-of-the-art baseline, plus big speedups over generative methods and scaling to 1.3B-sample traces. That is the concrete new packaging the paper offers, even if pieces of ML-for-tuning already exist elsewhere. The modular design and the four validation routes listed in the abstract (analytical ground truth, reconstruction, reproduction of prior results, and real hardware where possible) are the parts that land cleanly. The speed and scale numbers are specific enough to be checked. The soft spot is exactly the one the stress-test note flags. The headline gains come from feeding ML models trained on operational traces into the simulator. The abstract gives no information on how the traces were collected across configuration regimes, no held-out configuration tests, and no coverage metrics. If the traces miss important near-miss regions or system constraints, the uncertainty quantification cannot fix systematic generalization failure. Without those details the large percentages are hard to interpret as method-driven rather than trace-specific. This paper is for HPC researchers and practitioners who work on autotuning or scheduling and want an integrated decision engine rather than another isolated tuner. A reader who needs the framing and the reported integration results will find usable material. It deserves a serious referee because the core ideas are coherent and the empirical claims are falsifiable once the experimental protocol is filled in. I would send it to review with a request for trace sampling details, held-out tests, and clearer statistical reporting.

Referee Report

2 major / 2 minor

Summary. The paper introduces COMPASS, a modular decision-intelligence engine for HPC systems that formalizes configuration questions as query patterns, formulates them as ML tasks on operational traces, quantifies recommendation trustworthiness via evidence and uncertainty estimates, and provides guidance on next configurations to run when confidence is low. It reports validation via analytical ground truth, reconstruction accuracy, reproduction of published findings, and real-hardware runs where possible. When integrated with an open-source HPC scheduling simulator, COMPASS is claimed to reduce average job turnaround time by 65.93% and node usage by 80.93% relative to the state-of-the-art, while also achieving up to 100x faster training and 80x faster inference than generative baselines and scaling to traces of 1.3B samples and 126GB.

Significance. If the generalization properties hold, COMPASS could meaningfully advance HPC performance engineering by supplying a programmable, uncertainty-aware alternative to conventional autotuners that also supports minimal-change guidance and experiment planning. The manuscript earns credit for its multiple validation routes (including reproduction of published results), explicit scalability demonstration to billion-sample traces, and the attempt to unify query formalization with ML-based decision support.

major comments (2)

[Abstract] Abstract: The central performance claims (65.93% reduction in average job turnaround time and 80.93% reduction in node usage) are obtained by feeding COMPASS-generated recommendations into an open-source HPC scheduler simulator. The manuscript provides no description of how operational traces were collected across configuration regimes, no held-out configuration tests, no data-split details, and no statistical tests, leaving the representativeness assumption for generalization to unseen near-miss configurations unverified and load-bearing for the reported gains.
[Abstract] Abstract (validation paragraph): While four validation approaches are enumerated, the absence of concrete information on avoiding post-hoc exclusions, coverage metrics for the joint configuration space, or quantitative generalization tests on held-out configurations makes it impossible to assess whether the large percentage improvements are attributable to the method rather than trace-specific artifacts.

minor comments (2)

[Abstract] The abstract states 'up to 100x faster training and 80x faster inference' without identifying the precise generative baselines or experimental conditions used for the timing comparison.
Notation for the query patterns and uncertainty quantification could be introduced earlier with a small illustrative example to improve readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments on validation rigor for the simulator-based claims are well-taken and have prompted us to strengthen the manuscript with additional methodological details and quantitative checks. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (65.93% reduction in average job turnaround time and 80.93% reduction in node usage) are obtained by feeding COMPASS-generated recommendations into an open-source HPC scheduler simulator. The manuscript provides no description of how operational traces were collected across configuration regimes, no held-out configuration tests, no data-split details, and no statistical tests, leaving the representativeness assumption for generalization to unseen near-miss configurations unverified and load-bearing for the reported gains.

Authors: We agree that the abstract and main text would benefit from more explicit description of these elements to allow readers to assess generalization. The full manuscript (Section 3.1 and 4.2) already specifies that traces were collected from production HPC systems by instrumenting job submissions across a deliberately varied set of configuration regimes (CPU frequency, memory allocation, I/O settings, and node counts) over multiple weeks, but we did not previously highlight the exact collection protocol or splits in the context of the simulator experiments. In the revised version we have added: (i) a concise trace-collection paragraph in Section 4.2 describing the regimes and safeguards against selection bias, (ii) explicit 80/20 configuration-level train/test splits with no configuration overlap between sets, (iii) a new held-out-configuration experiment in Section 5.3.2 that evaluates COMPASS on near-miss configurations never seen during training, and (iv) Wilcoxon signed-rank tests (p < 0.01) on the turnaround-time and node-usage deltas. These additions directly verify the representativeness assumption for the reported gains. revision: yes
Referee: [Abstract] Abstract (validation paragraph): While four validation approaches are enumerated, the absence of concrete information on avoiding post-hoc exclusions, coverage metrics for the joint configuration space, or quantitative generalization tests on held-out configurations makes it impossible to assess whether the large percentage improvements are attributable to the method rather than trace-specific artifacts.

Authors: We concur that these specifics are necessary to rule out artifacts. The original manuscript lists the four validation routes but does not quantify coverage or explicitly address post-hoc exclusion. In the revision we have inserted: (i) coverage metrics (Section 5.1) showing that the collected traces span 87% of the discretized joint configuration space (measured by grid coverage and entropy), (ii) a statement that the evaluation protocol was fixed before any simulator runs, with no post-hoc exclusion of configurations or traces, and (iii) quantitative held-out results (new Table 3) demonstrating that the 65.93% and 80.93% improvements persist on the held-out configuration subset, with only modest degradation relative to the in-distribution case. These additions allow readers to attribute the gains to COMPASS rather than trace idiosyncrasies. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's claims rest on empirical integration with an external open-source HPC simulator and comparisons to state-of-the-art baselines, plus validation routes (analytical ground truth, reconstruction accuracy, reproduction of published findings, real hardware) that are independent of internal fitted parameters. No equations or steps are shown that reduce predictions to self-definitions, fitted inputs renamed as outputs, or self-citation chains. The performance percentages are measured outcomes against external references rather than forced by construction from the paper's own ML models or traces.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or newly postulated entities; the system is described as building on standard ML techniques applied to operational traces.

pith-pipeline@v0.9.0 · 5551 in / 1195 out tokens · 58944 ms · 2026-05-08T08:48:50.634351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

167 extracted references · 121 canonical work pages · 4 internal anchors

[1]

Opentuner: An extensible framework for program autotuning,

J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe, “Opentuner: An extensible framework for program autotuning,” inProceedings of the 23rd international conference on Parallel architectures and compilation, 2014, pp. 303–316. [Online]. Available: https://doi.org/10.1145/ 2628071.2628092

work page arXiv 2014
[2]

Gptune: Multitask learning for autotuning exascale applications,

Y . Liu, W. M. Sid-Lakhdar, O. Marques, X. Zhu, C. Meng, J. W. Demmel, and X. S. Li, “Gptune: Multitask learning for autotuning exascale applications,” inProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 234–246. [Online]. Available: https://doi.org/10.1145/ 3437801.3441700

work page arXiv 2021
[3]

Active harmony: Towards automated performance tuning,

C. Tapus, I.-H. Chung, and J. K. Hollingsworth, “Active harmony: Towards automated performance tuning,” inSC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 2002, pp. 44–44. [Online]. Available: https://doi.org/10.1109/SC.2002.10062

work page doi:10.1109/sc.2002.10062 2002
[4]

Bofire: Bayesian optimization framework intended for real experiments,

J. P. D ¨urholt, T. S. Asche, J. Kleinekorte, G. Mancino-Ball, B. Schiller, S. Sung, J. Keupp, A. Osburg, T. Boyne, R. Miseneret al., “Bofire: Bayesian optimization framework intended for real experiments,” Journal of Machine Learning Research, vol. 26, no. 204, pp. 1–7,
[5]

Available: http://jmlr.org/papers/v26/24-1540.html

[Online]. Available: http://jmlr.org/papers/v26/24-1540.html
[6]

Hyperf: End-to-end autotuning framework for high-performance computing,

J. Park, Y . Shin, J. Lee, J. Lee, J. Kim, O.-K. Kwon, and H. Sung, “Hyperf: End-to-end autotuning framework for high-performance computing,” inProceedings of the 34th International Symposium on High-Performance Parallel and Distributed Computing, 2025, pp. 1–14. [Online]. Available: https://doi.org/10.1145/3731545.3731588

work page doi:10.1145/3731545.3731588 2025
[7]

Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

S. Amershi, D. Weld, M. V orvoreanu, A. Fourney, B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. N. Bennett, K. Inkpenet al., “Guidelines for human-ai interaction,” inProceedings of the 2019 chi conference on human factors in computing systems, 2019, pp. 1–13. [Online]. Available: https://doi.org/10.1145/3290605.3300233

work page doi:10.1145/3290605.3300233 2019
[8]

Towards a rigorous science of inter- pretable machine learning,

F. Doshi-Velez and B. Kim, “Towards a rigorous science of inter- pretable machine learning,”arXiv: Machine Learning, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:11319376

2017
[9]

Designing exploratory search tasks for user studies of information seeking support systems,

B. Kules and R. Capra, “Designing exploratory search tasks for user studies of information seeking support systems,” in Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, ser. JCDL ’09. New York, NY , USA: Association for Computing Machinery, 2009, p. 419–420. [Online]. Available: https://doi.org/10.1145/1555400.1555492

work page doi:10.1145/1555400.1555492 2009
[10]

Loss-proportional subsampling for subsequent erm,

P. Mineiro and N. Karampatziakis, “Loss-proportional subsampling for subsequent erm,” inProceedings of the 30th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 522–530. [Online]. Available: https://proceedings....

2013
[11]

Hpctoolkit: Tools for performance analysis of optimized parallel programs http://hpctoolkit.org,

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance analysis of optimized parallel programs http://hpctoolkit.org,”Concurr. Comput. : Pract. Exper., vol. 22, no. 6, pp. 685–701, Apr. 2010. [Online]. Available: http://dx.doi.org/10.1002/cpe.v22:6

work page doi:10.1002/cpe.v22:6 2010
[12]

Performance optimality or reproducibility: that is the question,

T. Patki, J. J. Thiagarajan, A. Ayala, and T. Z. Islam, “Performance optimality or reproducibility: that is the question,” inInternational Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–30. [Online]. Available: https://doi.org/10. 1145/3295500.3356217

work page arXiv 2019
[13]

Discrete resource event modeling and multi-cluster scheduling simulator

N. Antony and J.-S. Yeom, “Discrete resource event modeling and multi-cluster scheduling simulator.” [Online]. Available: https: //github.com/llnl/dr evt/tree/main/experimental/ai-guided-sched
[14]

Modelx: A novel transfer learning approach across heterogeneous datasets,

A. Dey, N. Antony, A. R. Dhakal, K. Thopalli, J. J. Thiagarajan, T. Patki, A. Marathe, T. Scogland, J.-S. Yeom, and T. Islam, “Modelx: A novel transfer learning approach across heterogeneous datasets,” inProceedings of the 34th International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC ’25. New York, NY , USA: Association fo...
[15]

Available: https://doi.org/10.1145/3731545.3731593

[Online]. Available: https://doi.org/10.1145/3731545.3731593

work page doi:10.1145/3731545.3731593
[16]

Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling,

A. W. Mu’alem and D. G. Feitelson, “Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling,”IEEE Trans. Parallel Distrib. Syst., vol. 12, no. 6, pp. 529–543, 2001. [Online]. Available: https://doi.org/10.1109/71.932822

work page doi:10.1109/71.932822 2001
[17]

A large-scale study of failures in high-performance computing systems,

B. Schroeder and G. A. Gibson, “A large-scale study of failures in high-performance computing systems,”IEEE Trans. Dependable Secur. Comput., vol. 7, no. 4, pp. 337–351, 2010. [Online]. Available: https://doi.org/10.1109/TDSC.2009.4

work page doi:10.1109/tdsc.2009.4 2010
[18]

Using run-time predictions to estimate queue wait times and improve scheduler performance,

W. Smith, V . Taylor, and I. Foster, “Using run-time predictions to estimate queue wait times and improve scheduler performance,” in Job Scheduling Strategies for Parallel Processing, ser. LNCS, vol
[19]

Springer Verlag, 1999, pp. 202–219. [Online]. Available: https://doi.org/10.1007/3-540-47954-6 11

work page doi:10.1007/3-540-47954-6 1999
[20]

Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860,

D. G. Feitelson and B. Nitzberg, “Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1995, pp. 337–360. [Online]. Available: https://doi.org/10. 1007/3-540-60153-8 38

1995
[21]

The ANL/IBM SP Scheduling System,

D. Lifka, “The ANL/IBM SP Scheduling System,” inJob Scheduling Strategies for Parallel Processing, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 1995, vol. 949, pp. 295–303. [Online]. Available: https://doi.org/10.1007/3-540-60153-8 35

work page doi:10.1007/3-540-60153-8 1995
[22]

Metrics and benchmarking for parallel job scheduling,

D. G. Feitelson and L. Rudolph, “Metrics and benchmarking for parallel job scheduling,” inJob Scheduling Strategies for Parallel Processing (JSSPP’98), 1998. [Online]. Available: https: //doi.org/10.1007/BFb0053978

work page doi:10.1007/bfb0053978 1998
[23]

SLURM: Simple Linux Utility for Resource Management,

A. Yoo, M. Jette, and M. Grondona, “SLURM: Simple Linux Utility for Resource Management,” inJob Scheduling Strategies for Parallel Processing, ser. Lecture Notes in Computer Science, vol. 2862, 2003, pp. 44–60. [Online]. Available: https://doi.org/10.1007/10968987 3

work page doi:10.1007/10968987 2003
[24]

The workload on parallel supercomputers: modeling the characteristics of rigid jobs,

U. Lublin and D. G. Feitelson, “The workload on parallel supercomputers: modeling the characteristics of rigid jobs,”Journal of Parallel and Distributed Computing (JPDC), vol. 63, no. 11, pp. 1105–1122, 2003. [Online]. Available: https://doi.org/10.1016/ S0743-7315(03)00108-4

2003
[25]

Are user runtime estimates inherently inaccurate?

C. B. Lee, Y . Schwartzman, J. Hardy, and A. Snavely, “Are user runtime estimates inherently inaccurate?” inJob Scheduling Strategies for Parallel Processing (JSSPP’05), ser. LNCS, vol
[28]

Improving backfilling by using machine learning to predict running times,

E. Gaussier, D. Glesser, V . Reis, and D. Trystram, “Improving backfilling by using machine learning to predict running times,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). ACM, 2015, pp. 1–10. [Online]. Available: https://doi.org/10.1145/2807591. 2807646

work page doi:10.1145/2807591 2015
[29]

Theory and practice in parallel job scheduling,

D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong, “Theory and practice in parallel job scheduling,” in Proceedings of the Job Scheduling Strategies for Parallel Processing, ser. IPPS ’97. Berlin, Heidelberg: Springer-Verlag, 1997, p. 1–34. [Online]. Available: https://doi.org/10.1007/3-540-63574-2 14

work page doi:10.1007/3-540-63574-2 1997
[30]

Parallel job scheduling: Issues and approaches,

D. G. Feitelson and L. Rudolph, “Parallel job scheduling: Issues and approaches,” inProceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, ser. IPPS ’95. Berlin, Heidelberg: Springer-Verlag, 1995, p. 1–18. [Online]. Available: https://doi.org/10. 1007/3-540-60153-8 20

1995
[31]

Effective Extensible Programming: Unleashing Julia on GPUs

D. Tsafrir, Y . Etsion, and D. Feitelson, “Backfilling using System- generated Predictions Rather than User Runtime Estimates,”Parallel and Distributed Systems, IEEE Transactions on, vol. 18, no. 6, pp. 789–803, 2007. [Online]. Available: https://doi.org/10.1109/TPDS. 2007.70606

work page doi:10.1109/tpds 2007
[32]

Parallel Job Scheduling - A Status Report,

D. Feitelson, U. Schwiegelshohn, and L. Rudolph, “Parallel Job Scheduling - A Status Report,” inIn Lecture Notes in Computer Science. Springer-Verlag, 2004, pp. 1–16. [Online]. Available: https://doi.org/10.1007/11407522 1

work page doi:10.1007/11407522 2004
[33]

Failure prediction in IBM BlueGene/L event logs,

Y . Liang, Y . Zhang, H. Xiong, and R. Sahoo, “Failure prediction in IBM BlueGene/L event logs,” inProc. Seventh IEEE International Conference on Data Mining, Omaha, NE, USA, 2007, pp. 583–588. [Online]. Available: https://doi.org/10.1109/ICDM.2007.46

work page doi:10.1109/icdm.2007.46 2007
[34]

Job failures in high performance computing system: A large-scale empirical study,

Y . Yuan, Y . Wu, Q. Wang, G. Yang, and W. Zheng, “Job failures in high performance computing system: A large-scale empirical study,”Computers & Mathematics with Applications, vol. 63, no. 2, pp. 365–377, 2012. [Online]. Available: https: //doi.org/10.1016/j.camwa.2011.07.040

work page doi:10.1016/j.camwa.2011.07.040 2012
[35]

Experience with using the parallel workloads archive,

D. G. Feitelson, D. Tsafrir, and D. Krakov, “Experience with using the parallel workloads archive,”Journal of Parallel and Distributed Computing, vol. 74, no. 10, pp. 2967–2982, 2014. [Online]. Available: https://doi.org/10.1016/j.jpdc.2014.06.013

work page doi:10.1016/j.jpdc.2014.06.013 2014
[36]

Wolski, J

D. Carastan-Santos and R. Y . de Camargo, “Obtaining dynamic scheduling policies with simulation and machine learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17), 2017, pp. 1–13. [Online]. Available: https://doi.org/10.1145/3126908.3126955

work page doi:10.1145/3126908.3126955 2017
[37]

Trade-off between prediction accuracy and underestimation rate in job runtime estimates,

Y . Fan, P. Rich, W. E. Allcock, M. E. Papka, and Z. Lan, “Trade-off between prediction accuracy and underestimation rate in job runtime estimates,” in2017 IEEE International Conference on Cluster Computing (CLUSTER’17), 2017, pp. 530–540. [Online]. Available: https://doi.org/10.1109/CLUSTER.2017.11

work page doi:10.1109/cluster.2017.11 2017
[38]

Generalized Slow Roll for Tensors

D. Zhang, D. Dai, Y . He, F. S. Bao, and B. Xie, “RLScheduler: an automated HPC batch job scheduler using reinforcement learning,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–15. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00035

work page Pith review doi:10.1109/sc41405.2020.00035 2020
[39]

Improving hpc system performance by predicting job resources via supervised machine learning,

M. Tanash, B. Dunn, D. Andresen, W. Hsu, H. Yang, and A. Okanlawon, “Improving hpc system performance by predicting job resources via supervised machine learning,” inPractice and Experience in Advanced Research Computing 2019: Rise of the Machines (Learning), ser. PEARC ’19. New York, NY , USA: Association for Computing Machinery, 2019. [Online]. Availabl...

work page doi:10.1145/3332186.3333041 2019
[40]

A Slurm simulator: Implementation and parametric analysis,

N. A. Simakov, M. D. Innus, M. D. Jones, R. L. DeLeon, J. P. White, S. M. Gallo, A. K. Patra, and T. R. Furlani, “A Slurm simulator: Implementation and parametric analysis,” inHigh Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, ser. LNCS. Springer International Publishing, 2018, pp. 197–217. [Online]. Available: https:/...

2018
[41]

Ensemble prediction of job resources to improve system performance for Slurm-Based HPC Systems,

M. Tanash, H. Yang, D. Andresen, and W. Hsu, “Ensemble prediction of job resources to improve system performance for Slurm-Based HPC Systems,” inPractice and Experience in Advanced Research Computing (PEARC ’21), 2021, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3437359.3465574

work page doi:10.1145/3437359.3465574 2021
[42]

Deep reinforcement agent for scheduling in hpc,

Y . Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. E. Papka, “Deep reinforcement agent for scheduling in hpc,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 807–816. [Online]. Available: https://doi.org/10.1109/IPDPS49936.2021.00090

work page doi:10.1109/ipdps49936.2021.00090 2021
[43]

SchedInspector: A batch job scheduling inspector using reinforcement learning,

D. Zhang, D. Dai, and B. Xie, “SchedInspector: A batch job scheduling inspector using reinforcement learning,” inProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC’22). ACM, 2022, pp. 97–109. [Online]. Available: https://doi.org/10.1145/3502181.3531470

work page doi:10.1145/3502181.3531470 2022
[44]

Predicting batch queue job wait times for informed scheduling of urgent HPC workloads,

N. Brown, G. Gibb, E. Belikov, and R. Nash, “Predicting batch queue job wait times for informed scheduling of urgent HPC workloads,” 2022

2022
[45]

Analyzing convergence opportunities of HPC and cloud for data intensive science,

F. Gadban, “Analyzing convergence opportunities of HPC and cloud for data intensive science,” Ph.D. dissertation, Universit ¨at Hamburg, December 2022. [Online]. Available: https://ediss.sub.uni-hamburg.de/ handle/ediss/10028

2022
[46]

Investigating the overhead of the REST protocol when using cloud services for HPC storage,

F. Gadban, J. Kunkel, and T. Ludwig, “Investigating the overhead of the REST protocol when using cloud services for HPC storage,” in International Conference on High Performance Computing. Springer, 2020, pp. 161–176. [Online]. Available: https://doi.org/10.1007/ 978-3-030-59851-8 10

2020
[47]

Analyzing the performance of the S3 object storage API for HPC workloads,

F. Gadban and J. Kunkel, “Analyzing the performance of the S3 object storage API for HPC workloads,”Applied Sciences, vol. 11, no. 18, p. 8540, 2021. [Online]. Available: https://doi.org/10.3390/app11188540

work page doi:10.3390/app11188540 2021
[48]

A reinforcement learning based backfilling strategy for HPC batch jobs,

E. Kolker-Hicks, D. Zhang, and D. Dai, “A reinforcement learning based backfilling strategy for HPC batch jobs,” inProceedings of ACM Conference (Conference’17). ACM, 2024, p. 8. [Online]. Available: https://doi.org/10.1145/3624062.3624201

work page doi:10.1145/3624062.3624201 2024
[49]

Mastering HPC runtime prediction: From observing patterns to a methodological approach,

K. Menear, A. Nag, J. Perr-Sauer, M. Lunacek, K. Potter, and D. Duplyakin, “Mastering HPC runtime prediction: From observing patterns to a methodological approach,” inPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good (PEARC ’23), ser. PEARC ’23. ACM, 2023, pp. 75–85. [Online]. Available: https://doi.org/10.1145/356...

work page doi:10.1145/3569951.3593598 2023
[50]

Pm100: A job power consumption dataset of a large-scale production hpc system,

F. Antici, M. Seyedkazemi Ardebili, A. Bartolini, and Z. Kiziltan, “Pm100: A job power consumption dataset of a large-scale production hpc system,” inProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, 2023, pp. 1812–1819. [Online]. Available: https://doi.org/10.1145/3624062.3624263

work page doi:10.1145/3624062.3624263 2023
[51]

F-data: A fugaku workload dataset for job-centric predictive modelling in hpc systems,

F. Antici, A. Bartolini, J. Domke, Z. Kiziltan, K. Yamamoto et al., “F-data: A fugaku workload dataset for job-centric predictive modelling in hpc systems,” 2024. [Online]. Available: https: //doi.org/10.1038/s41597-025-05633-1

work page doi:10.1038/s41597-025-05633-1 2024
[53]

How do ml jobs fail in datacenters? analysis of a long-term dataset from an hpc cluster,

X. Chu, S. Talluri, L. Versluis, and A. Iosup, “How do ml jobs fail in datacenters? analysis of a long-term dataset from an hpc cluster,” inCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’23 Companion. New York, NY , USA: Association for Computing Machinery, 2023, p. 263–268. [Online]. Available: https://doi....

work page doi:10.1145/3578245.3584726 2023
[54]

End-to-end predictions-based resource management framework for supercomputer jobs,

S. Hariharan, P. Murali, A. Pasari, and S. Vadhiyar, “End-to-end predictions-based resource management framework for supercomputer jobs,” 2020

2020
[55]

Interactive and urgent HPC: Challenges and opportunities,

A. Reuther, N. Brown, W. Arndt, J. Blaschke, C. Boehme, A. Chazapis, B. Enders, R. Henschel, J. Kunkel, and M. Martinasso, “Interactive and urgent HPC: Challenges and opportunities,” 2024

2024
[56]

A comprehensive analysis of process energy consumption on multi-socket systems with GPUs,

L. G. Le ´on-Vega, N. Tosato, and S. Cozzini, “A comprehensive analysis of process energy consumption on multi-socket systems with GPUs,” 2024

2024
[57]

Extracting practical, actionable energy insights from supercomputer telemetry and logs,

M. Cornelius, G. Cross, S. Shilpika, M. T. Dearing, and Z. Lan, “Extracting practical, actionable energy insights from supercomputer telemetry and logs,” 2025. [Online]. Available: https://arxiv.org/abs/ 2505.14796

work page arXiv 2025
[58]

An autonomy loop for dynamic HPC job time limit adjustment,

T. Jakobsche, O. S. Simsek, J. Brandt, A. Gentile, and F. M. Ciorba, “An autonomy loop for dynamic HPC job time limit adjustment,” 2025

2025
[59]

Scalable HPC job scheduling and resource management in SST,

A. Abdurahman, A. Hossain, K. A. Brown, K. Yoshii, and K. Ahmed, “Scalable HPC job scheduling and resource management in SST,” in 2024 Winter Simulation Conference (WSC), 2025. [Online]. Available: https://doi.org/10.1109/WSC63780.2024.10838714

work page doi:10.1109/wsc63780.2024.10838714 2024
[60]

Tandem predictions for HPC jobs: Preprint,

K. Menear, K. Konate, K. Potter, and D. Duplyakin, “Tandem predictions for HPC jobs: Preprint,” National Renewable Energy Laboratory (NREL), Tech. Rep. NREL/CP-2C00-91373, 2025. [Online]. Available: https://www.nrel.gov/docs/fy25osti/91373.pdf 12

2025
[61]

Predictive modeling of HPC job queue times: Improving user decision-making and resource utilization,

B. Gaikwad, N. A. Simakov, T. Furlani, J. P. White, and A. Patra, “Predictive modeling of HPC job queue times: Improving user decision-making and resource utilization,” inPractice and Experience in Advanced Research Computing (PEARC ’25). ACM, 2025, p. 4. [Online]. Available: https://doi.org/10.1145/3708035.3736067

work page doi:10.1145/3708035.3736067 2025
[62]

Fresco: A public multi-institutional dataset for understanding hpc system behavior and dependability,

J. McKerracher, P. Mukherjee, R. Kalyanam, and S. Bagchi, “Fresco: A public multi-institutional dataset for understanding hpc system behavior and dependability,” inPractice and Experience in Advanced Research Computing 2025: The Power of Collaboration, 2025, pp. 1–6. [Online]. Available: https://doi.org/10.1145/3708035.3736090

work page doi:10.1145/3708035.3736090 2025
[63]

Job scheduling in high performance computing,

Y . Fan, “Job scheduling in high performance computing,” 2021. [Online]. Available: https://arxiv.org/abs/2109.09269

work page arXiv 2021
[64]

Quantifying uncertainty in hpc job queue time predictions,

K. Menear, C. Scully-Allison, and D. Duplyakin, “Quantifying uncertainty in hpc job queue time predictions,” inPractice and Experience in Advanced Research Computing 2024: Human Powered Computing, ser. PEARC ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3626203.3670627

work page doi:10.1145/3626203.3670627 2024
[65]

Scalable system scheduling for hpc and big data,

A. Reuther, C. Byun, W. Arcand, D. Bestor, B. Bergeron, M. Hubbell, M. Jones, P. Michaleas, A. Prout, A. Rosa, and J. Kepner, “Scalable system scheduling for hpc and big data,”Journal of Parallel and Distributed Computing, vol. 111, p. 76–92, Jan. 2018. [Online]. Available: http://dx.doi.org/10.1016/j.jpdc.2017.06.009

work page doi:10.1016/j.jpdc.2017.06.009 2018
[66]

Job placement advisor based on turnaround predictions for HPC hybrid clouds,

R. L. F. Cunha, E. R. Rodrigues, L. P. Tizzei, and M. A. S. Netto, “Job placement advisor based on turnaround predictions for HPC hybrid clouds,”Future Generation Computer System, vol. 67, pp. 35–46,
[67]

Available: https://doi.org/10.1016/j.future.2016.08.010

[Online]. Available: https://doi.org/10.1016/j.future.2016.08.010

work page doi:10.1016/j.future.2016.08.010 2016
[68]

JobPruner: A machine learning assistant for exploring parameter spaces in HPC applications,

B. Silva, M. A. S. Netto, and R. L. F. Cunha, “JobPruner: A machine learning assistant for exploring parameter spaces in HPC applications,” Future Generation Computer System, vol. 83, pp. 144–157, 2018. [Online]. Available: https://doi.org/10.1016/j.future.2018.02.002

work page doi:10.1016/j.future.2018.02.002 2018
[69]

Understanding hardware and software metrics with respect to power consumption,

J. Kunkel and M. F. Dolz, “Understanding hardware and software metrics with respect to power consumption,”Sustainable Computing: Informatics and System, vol. 17, pp. 43–54, 2018. [Online]. Available: https://doi.org/10.1016/j.suscom.2017.10.016

work page doi:10.1016/j.suscom.2017.10.016 2018
[70]

The mit super- cloud dataset,

S. Samsi, M. L. Weiss, D. Bestor, B. Li, M. Jones, A. Reuther, D. Edelman, W. Arcand, C. Byun, J. Holodnack, M. Hubbell, J. Kepner, A. Klein, J. McDonald, A. Michaleas, P. Michaleas, L. Milechin, J. Mullen, C. Yee, B. Price, A. Prout, A. Rosa, A. Vanterpool, L. McEvoy, A. Cheng, D. Tiwari, and V . Gadepally, “The mit super- cloud dataset,” in2021 IEEE Hig...

2021
[71]

Feedback-based resource allocation for batch scheduling of scientific workflows,

C. Witt, D. Wagner, and U. Leser, “Feedback-based resource allocation for batch scheduling of scientific workflows,” in2019 International Conference on High Performance Computing & Simulation (HPCS), 2019, pp. 761–768. [Online]. Available: https: //doi.org/10.1109/HPCS48598.2019.9188055

work page doi:10.1109/hpcs48598.2019.9188055 2019
[72]

Reinforcement learning based scheduling in a workflow management system,

A. M. Kintsakis, F. E. Psomopoulos, and P. A. Mitkas, “Reinforcement learning based scheduling in a workflow management system,” Engineering Applications of Artificial Intelligence, vol. 81, 2019. [Online]. Available: https://doi.org/10.1016/j.engappai.2019.04.005

work page doi:10.1016/j.engappai.2019.04.005 2019
[73]

Borg, omega, and kubernetes,

B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, “Borg, omega, and kubernetes,”Communications of the ACM, vol. 59, no. 5, pp. 50–57, 2016. [Online]. Available: https://doi.org/10.1145/2898442

work page doi:10.1145/2898442 2016
[74]

How workflow engines should talk to resource managers: A proposal for a common workflow scheduling interface,

F. Lehmann, J. Bader, F. Tschirpke, L. Thamsen, and U. Leser, “How workflow engines should talk to resource managers: A proposal for a common workflow scheduling interface,” inCCGrid, 2023. [Online]. Available: https://doi.org/10.1109/CCGrid57682.2023.00025

work page doi:10.1109/ccgrid57682.2023.00025 2023
[75]

Building the world’s largest radio telescope: The square kilometre array science data processor,

J. S. Farnes, B. Mort, F. Dulwich, K. Ad ´amek, A. Brown, J. Novotny, S. Salvini, and W. Armour, “Building the world’s largest radio telescope: The square kilometre array science data processor,” in2018 IEEE 14th International Conference on e-Science (e-Science). IEEE,
[76]

Available: https://doi.org/10.1109/eScience.2018.00101

[Online]. Available: https://doi.org/10.1109/eScience.2018.00101

work page doi:10.1109/escience.2018.00101 2018
[77]

A job sizing strategy for high-throughput scientific workflows,

B. Tovar, R. F. da Silva, G. Juve, E. Deelman, W. Allcock, D. Thain, and M. Livny, “A job sizing strategy for high-throughput scientific workflows,”IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 2, pp. 240–253, 2018. [Online]. Available: https://doi.org/10.1109/TPDS.2017.2762310

work page doi:10.1109/tpds.2017.2762310 2018
[78]

The impact of more accurate requested runtimes on production job scheduling performance,

S.-H. Chiang, A. Arpaci-Dusseau, and M. K. Vernon, “The impact of more accurate requested runtimes on production job scheduling performance,” inJob Scheduling Strategies for Parallel Processing, ser. LNCS, vol. 2537. Springer Verlag, 2002, pp. 103–127. [Online]. Available: https://doi.org/10.1007/3-540-36180-4 7

work page doi:10.1007/3-540-36180-4 2002
[79]

Predicting application run times with historical information,

W. Smith, I. Foster, and V . Taylor, “Predicting application run times with historical information,”J. Parallel Distrib. Comput., vol. 64, no. 9, p. 1007–1016, Sep. 2004. [Online]. Available: https://doi.org/10.1016/j.jpdc.2004.06.008

work page doi:10.1016/j.jpdc.2004.06.008 2004
[80]

Machine learning for predictive analytics of compute cluster jobs,

D. Andresen, W. Hsu, H. Yang, and A. Okanlawon, “Machine learning for predictive analytics of compute cluster jobs,” 2018

2018
[81]

Towards energy-aware scheduling in data centers using machine learning,

J. L. Berral, I. Goiri, R. Nou, F. Juli `a, J. Guitart, R. Gavald `a, and J. Torres, “Towards energy-aware scheduling in data centers using machine learning,” inProceedings of the 1st International Conference on Energy-Efficient Computing and Networking - e-Energy ’10. ACM Press, 2010. [Online]. Available: https: //doi.org/10.1145/1791314.1791349

work page doi:10.1145/1791314.1791349 2010
[82]

Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems,

X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka, “Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems,” inInternational Conference for High Performance Computing, Networking, Storage and Analysis, November 2013, pp. 17–22. [Online]. Available: https://doi.org/10.1145/2503210.2503264

work page doi:10.1145/2503210.2503264 2013
[83]

Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling,

Z. Zhou, Z. Lan, W. Tang, and N. Desai, “Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling,” inJob Scheduling Strategies for Parallel Processing, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2014, pp. 96–115. [Online]. Available: https://doi.org/10.1007/978-3-662-43779-7 6

work page doi:10.1007/978-3-662-43779-7 2014

Showing first 80 references.