arxiv: 2605.00426 · v1 · submitted 2026-05-01 · 💻 cs.CE

Recognition: unknown

A Study on the Resource Utilization and User Behavior on Titan Supercomputer

Sergio Iserte

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:58 UTC · model grok-4.3

classification 💻 cs.CE

keywords resource utilizationuser behaviorTitan supercomputerHPC workload analysispredictive modelseasonalitydata clusteringneural networks

0 comments

The pith

Titan supercomputer logs show how projects, jobs, nodes, GPUs and memory relate, plus seasonal patterns and a utilization forecast model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies records from the Titan supercomputer, including resource manager logs, GPU traces, and project information, to map how users request and consume nodes, GPUs, and memory. It applies correlations, clustering, and neural networks to uncover relationships among these elements and to track how usage shifts over time. The work identifies seasonal variations in resource demand and builds a model that predicts future utilization. These results aim to raise productivity on current systems and guide design of future exascale machines, and the same methods can be reused on other HPC clusters.

Core claim

Examination of Titan logs reveals patterns in workload distribution and resource usage. Data science methods demonstrate connections among projects, jobs, nodes, GPUs, and memory, expose seasonality in usage, and produce a predictive model for forecasting supercomputer utilization.

What carries the argument

Correlations, clustering, and neural networks applied to system logs, GPU traces, and project data to extract inter-resource relationships and generate utilization forecasts.

Load-bearing premise

The Titan logs contain enough unbiased information to reveal generalizable relationships through correlations, clustering, and neural networks without overfitting or unstated selection steps.

What would settle it

Running the same correlation, clustering, and neural network pipeline on a fresh set of Titan logs or on logs from another supercomputer and checking whether the identified relationships and forecast accuracy remain consistent.

Figures

Figures reproduced from arXiv: 2605.00426 by Sergio Iserte.

**Figure 1.** Figure 1: Correlation matrix after the data preprocessing. view at source ↗

**Figure 2.** Figure 2: Number of jobs assigned to each area. Among the 30 areas, four of them stand out with a higher number of jobs. These four predominant areas correspond to Chemistry (10), Lattice Gauge Theory (25), Materials Science (28), and Biophysics (8), respectively in the figure from left to right. Coincidentally, when using the elbow method to get an insight into the number of possible clusters in which the data cou… view at source ↗

**Figure 3.** Figure 3: Count of run jobs (y-axis) per month (x-axis). view at source ↗

**Figure 4.** Figure 4: Seasonality per month aggregated by sum. view at source ↗

**Figure 5.** Figure 5: Seasonality per month aggregated by mean. view at source ↗

**Figure 6.** Figure 6: Seasonality per month aggregated by maximum. view at source ↗

**Figure 7.** Figure 7: Correlation matrix for aggregated data view at source ↗

**Figure 8.** Figure 8: Evolution in time of the resources view at source ↗

**Figure 9.** Figure 9: Neural network architecture. The presented model is compiled with the Adam optimizer [9] to update weights and biases within the network, and relies on the mean square error loss function: MSE = 1 n Xn i=1 (yi − y ′ i ) 2 , (1) After normalizing, preparing the data for supervised learning, shuffling, and make partitions of the data (80% train and 20% test), the model is evaluated with a 0.4% error. In othe… view at source ↗

read the original abstract

Understanding HPC facilities users' behaviors and how computational resources are requested and utilized is not only crucial for the cluster productivity but also essential for designing and constructing future exascale HPC systems. This paper tackles Challenge 4, 'Analyzing Resource Utilization and User Behavior on Titan Supercomputer', of the 2021 Smoky Mountains Conference Data Challenge. Specifically, we dig deeper inside the records of Titan to discover patterns and extract relationships. This paper explores the workload distribution and usage patterns from resource manager system logs, GPU traces, and scientific areas information collected from the Titan supercomputer. Furthermore, we want to know how resource utilization and user behaviors change over time. Using data science methods, such as correlations, clustering, or neural networks, our findings allow us to investigate how projects, jobs, nodes, GPUs and memory are related. We provide insights about seasonality usage of resources and a predictive model for forecasting utilization of Titan Supercomputer. In addition, the described methodology can be easily adopted in other HPC clusters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competent data-challenge paper that applies standard clustering and a basic neural net to Titan logs for utilization patterns and forecasts, without new methods or broad theoretical claims.

read the letter

The core of this work is an analysis of Titan supercomputer logs that pulls out relationships among projects, jobs, nodes, GPUs, and memory, plus seasonality trends and a simple utilization forecast. It does this by processing resource manager records and GPU traces, then applying correlations, k-means clustering with silhouette scores, and a feed-forward neural network trained on daily aggregates. The paper is clear on feature extraction and reports basic hold-out numbers for the forecaster, which makes the results traceable. The claim that the approach can transfer to other clusters is reasonable given the explicit steps. What stands out is the concrete seasonality observations and the way they tie resource use to scientific areas, which could help operators spot recurring load patterns on similar machines. The data challenge framing keeps the scope practical rather than claiming fundamental advances. The soft spots are proportionate to the contribution. The techniques are established ones from prior workload studies, so the value is in the Titan-specific application and the public dataset reuse rather than any algorithmic novelty. Validation stays at the level of silhouette scores and hold-out splits without deeper baselines or sensitivity checks on data exclusions. Generalizability is flagged as limited, which is accurate but caps how far the predictive model travels beyond this facility. No internal contradictions or hidden fitting issues appear in the methods. This paper suits readers who manage or study large-scale HPC systems and want worked examples of log-driven insights. Someone building scheduling tools or exascale planning models could extract usable patterns from the figures and replication details. It is coherent on its own terms and engages the literature through the data challenge context. I would bring it to a reading group for discussion of practical HPC analytics. I would not cite it in my own work unless I needed the specific Titan seasonality numbers. It deserves peer review because the explicit methodology and public data make it a useful addition to the workload characterization record, even if revisions would mainly strengthen the validation sections.

Referee Report

1 major / 2 minor

Summary. The paper analyzes job logs, GPU traces, and scientific area data from the Titan supercomputer to explore workload distribution, relationships among projects/jobs/nodes/GPUs/memory, seasonality in resource usage, and a predictive model for utilization forecasting. It applies standard data-science techniques (correlations, k-means clustering with silhouette scores, and a simple feed-forward neural network) and positions the methodology as adaptable to other HPC systems. This is framed as a contribution to the 2021 Smoky Mountains Conference Data Challenge.

Significance. If the reported patterns and model hold, the work supplies concrete, dataset-specific insights into real-world HPC utilization that could aid schedulers and exascale design. The explicit feature extraction from traces and provision of basic hold-out numbers for the forecaster are positive; however, as an observational data-challenge study without parameter-free derivations or falsifiable predictions beyond the specific logs, its broader impact is incremental rather than transformative.

major comments (1)

[predictive model section] The predictive-model section provides only a high-level description of the feed-forward network and aggregated daily utilization; no architecture details (layers, hidden units), training hyperparameters, or exact hold-out metrics (e.g., MAE, RMSE on which time window) are stated. This directly limits assessment of whether the forecasting claim is supported beyond a baseline fit.

minor comments (2)

[data description] Clarify the exact time span and any filtering rules applied to the Titan logs before feature extraction; this would strengthen reproducibility.
[seasonality analysis] The abstract claims 'insights about seasonality' but the main text should explicitly link the clustering or correlation results to specific seasonal patterns with quantitative support.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and agree that additional details are needed to strengthen the predictive model section.

read point-by-point responses

Referee: [predictive model section] The predictive-model section provides only a high-level description of the feed-forward network and aggregated daily utilization; no architecture details (layers, hidden units), training hyperparameters, or exact hold-out metrics (e.g., MAE, RMSE on which time window) are stated. This directly limits assessment of whether the forecasting claim is supported beyond a baseline fit.

Authors: We agree with the referee that the predictive-model section in the current manuscript is described at a high level only. The text does not specify the feed-forward network architecture (number of layers or hidden units), training hyperparameters, or exact hold-out metrics such as MAE and RMSE for a defined time window. In the revised manuscript we will expand this section to include these details: the network structure, activation functions, optimizer and learning rate, number of epochs, and the precise performance numbers (MAE, RMSE) on the hold-out set together with the exact time window used for evaluation. This will allow readers to assess the forecasting results more rigorously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; observational analysis only

full rationale

The manuscript is an empirical study of Titan logs using standard data-science pipelines (feature extraction from job/GPU traces, k-means clustering with silhouette validation, and a simple feed-forward neural network trained on daily aggregates with hold-out evaluation). No equations, ansatzes, or uniqueness theorems are invoked. The reported correlations, seasonality patterns, and utilization forecasts are direct outputs of the applied algorithms on the input traces; they do not reduce to the inputs by construction or via self-citation chains. The work is therefore self-contained as observational analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the analysis relies on standard statistical and machine-learning assumptions that are not enumerated.

pith-pipeline@v0.9.0 · 5465 in / 1113 out tokens · 27855 ms · 2026-05-09T18:58:59.020029+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Patel, Z

T. Patel, Z. Liu, R. Kettimuthu, P. Rich, W. Allcock, D. Tiwari, Job Characteris- tics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–17

2020
[2]

S. Dash, A. K. Paul, F. Wang, S. Oral, T. Integration, SMC Data Challenge 2021: Analyzing Resource Utilization and User Behavior on Titan Supercomputer (2021). URL https://smc-datachallenge.ornl.gov/wp-content/uploads/2021/05/C4-SMC DataChallenge 2021.pdf

2021
[3]

Top500 the list, https://www.top500.org, accessed: 2021-08-04

2021
[4]

URL https://www.olcf.ornl.gov/wp-content/themes/olcf/titan/Titan Debuts.pdf

Oak Ridge National Laboratory, ORNL Debuts Titan Supercomputer (2012). URL https://www.olcf.ornl.gov/wp-content/themes/olcf/titan/Titan Debuts.pdf

2012
[5]

F. Wang, S. Oral, S. Sen, N. Imam, Learning from Five-year Resource-Utilization Data of Titan System, Proceedings - IEEE International Conference on Cluster Computing, ICCC 2019-Septe (2019). doi:10.1109/CLUSTER.2019.8891001

work page doi:10.1109/cluster.2019.8891001 2019
[6]

Ostrouchov, D

G. Ostrouchov, D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, J. H. Rogers, GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability, Inter- national Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020-November (2020)

2020
[7]

X. Jin, J. Han, K-Means Clustering, Springer US, Boston, MA, 2010, pp. 563–564

2010
[8]

R. B. Cleveland, W. S. Cleveland, J. E. McRae, I. Terpenning, Stl: A seasonal-trend decomposition, J. Off. Stat 6 (1) (1990) 3–73

1990
[9]

D. P. Kingma, J. L. Ba, Adam: A Method for Stochastic Optimization, 3rd Inter- national Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015) 1–15arXiv:1412.6980. 2 https://innova.gva.es/va/web/ciencia/a-programa-i-d-i/-/asset publisher/jMe1U DRYZMHO/content/iv-subvenciones-para-la-contratacion-de-personal-investigado r-e...

work page internal anchor Pith review Pith/arXiv arXiv 2015