pith. machine review for the scientific record. sign in

arxiv: 2605.00426 · v1 · submitted 2026-05-01 · 💻 cs.CE

Recognition: unknown

A Study on the Resource Utilization and User Behavior on Titan Supercomputer

Sergio Iserte

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:58 UTC · model grok-4.3

classification 💻 cs.CE
keywords resource utilizationuser behaviorTitan supercomputerHPC workload analysispredictive modelseasonalitydata clusteringneural networks
0
0 comments X

The pith

Titan supercomputer logs show how projects, jobs, nodes, GPUs and memory relate, plus seasonal patterns and a utilization forecast model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies records from the Titan supercomputer, including resource manager logs, GPU traces, and project information, to map how users request and consume nodes, GPUs, and memory. It applies correlations, clustering, and neural networks to uncover relationships among these elements and to track how usage shifts over time. The work identifies seasonal variations in resource demand and builds a model that predicts future utilization. These results aim to raise productivity on current systems and guide design of future exascale machines, and the same methods can be reused on other HPC clusters.

Core claim

Examination of Titan logs reveals patterns in workload distribution and resource usage. Data science methods demonstrate connections among projects, jobs, nodes, GPUs, and memory, expose seasonality in usage, and produce a predictive model for forecasting supercomputer utilization.

What carries the argument

Correlations, clustering, and neural networks applied to system logs, GPU traces, and project data to extract inter-resource relationships and generate utilization forecasts.

Load-bearing premise

The Titan logs contain enough unbiased information to reveal generalizable relationships through correlations, clustering, and neural networks without overfitting or unstated selection steps.

What would settle it

Running the same correlation, clustering, and neural network pipeline on a fresh set of Titan logs or on logs from another supercomputer and checking whether the identified relationships and forecast accuracy remain consistent.

Figures

Figures reproduced from arXiv: 2605.00426 by Sergio Iserte.

Figure 1
Figure 1. Figure 1: Correlation matrix after the data preprocessing. view at source ↗
Figure 2
Figure 2. Figure 2: Number of jobs assigned to each area. Among the 30 areas, four of them stand out with a higher number of jobs. These four predominant areas correspond to Chemistry (10), Lattice Gauge The￾ory (25), Materials Science (28), and Biophysics (8), respectively in the figure from left to right. Coincidentally, when using the elbow method to get an insight into the number of possible clusters in which the data cou… view at source ↗
Figure 3
Figure 3. Figure 3: Count of run jobs (y-axis) per month (x-axis). view at source ↗
Figure 4
Figure 4. Figure 4: Seasonality per month aggregated by sum. view at source ↗
Figure 5
Figure 5. Figure 5: Seasonality per month aggregated by mean. view at source ↗
Figure 6
Figure 6. Figure 6: Seasonality per month aggregated by maximum. view at source ↗
Figure 7
Figure 7. Figure 7: Correlation matrix for aggregated data view at source ↗
Figure 8
Figure 8. Figure 8: Evolution in time of the resources view at source ↗
Figure 9
Figure 9. Figure 9: Neural network architecture. The presented model is compiled with the Adam optimizer [9] to update weights and biases within the network, and relies on the mean square error loss function: MSE = 1 n Xn i=1 (yi − y ′ i ) 2 , (1) After normalizing, preparing the data for supervised learning, shuffling, and make partitions of the data (80% train and 20% test), the model is evaluated with a 0.4% error. In othe… view at source ↗
read the original abstract

Understanding HPC facilities users' behaviors and how computational resources are requested and utilized is not only crucial for the cluster productivity but also essential for designing and constructing future exascale HPC systems. This paper tackles Challenge 4, 'Analyzing Resource Utilization and User Behavior on Titan Supercomputer', of the 2021 Smoky Mountains Conference Data Challenge. Specifically, we dig deeper inside the records of Titan to discover patterns and extract relationships. This paper explores the workload distribution and usage patterns from resource manager system logs, GPU traces, and scientific areas information collected from the Titan supercomputer. Furthermore, we want to know how resource utilization and user behaviors change over time. Using data science methods, such as correlations, clustering, or neural networks, our findings allow us to investigate how projects, jobs, nodes, GPUs and memory are related. We provide insights about seasonality usage of resources and a predictive model for forecasting utilization of Titan Supercomputer. In addition, the described methodology can be easily adopted in other HPC clusters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper analyzes job logs, GPU traces, and scientific area data from the Titan supercomputer to explore workload distribution, relationships among projects/jobs/nodes/GPUs/memory, seasonality in resource usage, and a predictive model for utilization forecasting. It applies standard data-science techniques (correlations, k-means clustering with silhouette scores, and a simple feed-forward neural network) and positions the methodology as adaptable to other HPC systems. This is framed as a contribution to the 2021 Smoky Mountains Conference Data Challenge.

Significance. If the reported patterns and model hold, the work supplies concrete, dataset-specific insights into real-world HPC utilization that could aid schedulers and exascale design. The explicit feature extraction from traces and provision of basic hold-out numbers for the forecaster are positive; however, as an observational data-challenge study without parameter-free derivations or falsifiable predictions beyond the specific logs, its broader impact is incremental rather than transformative.

major comments (1)
  1. [predictive model section] The predictive-model section provides only a high-level description of the feed-forward network and aggregated daily utilization; no architecture details (layers, hidden units), training hyperparameters, or exact hold-out metrics (e.g., MAE, RMSE on which time window) are stated. This directly limits assessment of whether the forecasting claim is supported beyond a baseline fit.
minor comments (2)
  1. [data description] Clarify the exact time span and any filtering rules applied to the Titan logs before feature extraction; this would strengthen reproducibility.
  2. [seasonality analysis] The abstract claims 'insights about seasonality' but the main text should explicitly link the clustering or correlation results to specific seasonal patterns with quantitative support.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and agree that additional details are needed to strengthen the predictive model section.

read point-by-point responses
  1. Referee: [predictive model section] The predictive-model section provides only a high-level description of the feed-forward network and aggregated daily utilization; no architecture details (layers, hidden units), training hyperparameters, or exact hold-out metrics (e.g., MAE, RMSE on which time window) are stated. This directly limits assessment of whether the forecasting claim is supported beyond a baseline fit.

    Authors: We agree with the referee that the predictive-model section in the current manuscript is described at a high level only. The text does not specify the feed-forward network architecture (number of layers or hidden units), training hyperparameters, or exact hold-out metrics such as MAE and RMSE for a defined time window. In the revised manuscript we will expand this section to include these details: the network structure, activation functions, optimizer and learning rate, number of epochs, and the precise performance numbers (MAE, RMSE) on the hold-out set together with the exact time window used for evaluation. This will allow readers to assess the forecasting results more rigorously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; observational analysis only

full rationale

The manuscript is an empirical study of Titan logs using standard data-science pipelines (feature extraction from job/GPU traces, k-means clustering with silhouette validation, and a simple feed-forward neural network trained on daily aggregates with hold-out evaluation). No equations, ansatzes, or uniqueness theorems are invoked. The reported correlations, seasonality patterns, and utilization forecasts are direct outputs of the applied algorithms on the input traces; they do not reduce to the inputs by construction or via self-citation chains. The work is therefore self-contained as observational analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the analysis relies on standard statistical and machine-learning assumptions that are not enumerated.

pith-pipeline@v0.9.0 · 5465 in / 1113 out tokens · 27855 ms · 2026-05-09T18:58:59.020029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Patel, Z

    T. Patel, Z. Liu, R. Kettimuthu, P. Rich, W. Allcock, D. Tiwari, Job Characteris- tics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–17

  2. [2]

    S. Dash, A. K. Paul, F. Wang, S. Oral, T. Integration, SMC Data Challenge 2021: Analyzing Resource Utilization and User Behavior on Titan Supercomputer (2021). URL https://smc-datachallenge.ornl.gov/wp-content/uploads/2021/05/C4-SMC DataChallenge 2021.pdf

  3. [3]

    Top500 the list, https://www.top500.org, accessed: 2021-08-04

  4. [4]

    URL https://www.olcf.ornl.gov/wp-content/themes/olcf/titan/Titan Debuts.pdf

    Oak Ridge National Laboratory, ORNL Debuts Titan Supercomputer (2012). URL https://www.olcf.ornl.gov/wp-content/themes/olcf/titan/Titan Debuts.pdf

  5. [5]

    F. Wang, S. Oral, S. Sen, N. Imam, Learning from Five-year Resource-Utilization Data of Titan System, Proceedings - IEEE International Conference on Cluster Computing, ICCC 2019-Septe (2019). doi:10.1109/CLUSTER.2019.8891001

  6. [6]

    Ostrouchov, D

    G. Ostrouchov, D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, J. H. Rogers, GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability, Inter- national Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020-November (2020)

  7. [7]

    X. Jin, J. Han, K-Means Clustering, Springer US, Boston, MA, 2010, pp. 563–564

  8. [8]

    R. B. Cleveland, W. S. Cleveland, J. E. McRae, I. Terpenning, Stl: A seasonal-trend decomposition, J. Off. Stat 6 (1) (1990) 3–73

  9. [9]

    D. P. Kingma, J. L. Ba, Adam: A Method for Stochastic Optimization, 3rd Inter- national Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015) 1–15arXiv:1412.6980. 2 https://innova.gva.es/va/web/ciencia/a-programa-i-d-i/-/asset publisher/jMe1U DRYZMHO/content/iv-subvenciones-para-la-contratacion-de-personal-investigado r-e...