arxiv: 2605.02125 · v2 · submitted 2026-05-04 · 💻 cs.DC · cs.LG

Recognition: no theorem link

FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training

Yijiang Li , Emon Dey , Zilinghan Li , Krishnan Raghavan , Ravi Madduri , Kibaek Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:27 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords federated learningHPC schedulingqueue delaysstaleness controlcross-facility trainingnon-convex convergence

0 comments

The pith

FedQueue predicts HPC queue delays to budget local work, bound update staleness with cutoffs, and stabilize aggregation for non-convex federated learning across facilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a federated learning protocol that directly incorporates batch scheduler delays into training and model aggregation for cross-facility HPC settings. It uses online queue predictions to set local work budgets, cutoff admission to buffer late updates and limit staleness, and staleness-aware aggregation to handle uneven workloads. This setup proves convergence at O(1/sqrt(R)) for non-convex objectives when staleness stays bounded and shows that admission rules keep staleness bounded with high probability despite prediction errors. Real deployments achieve 20.5 percent improvement over baselines while simulations report 34 percent faster time to target accuracy under high variance and non-IID data partitions.

Core claim

FedQueue predicts per-facility queue delays online to budget local work, applies cutoff-based admission that buffers late arrivals to bound staleness, and performs staleness-aware aggregation to stabilize heterogeneous local workloads. It proves convergence for non-convex objectives at rate O(1/sqrt(R)) under bounded staleness and shows that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment shows 20.5 percent improvement over baseline algorithms while controlled simulations demonstrate about 34 percent reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.

What carries the argument

The FedQueue protocol that integrates online queue-delay prediction for local-work budgeting, cutoff-based admission control to bound staleness, and staleness-aware aggregation for heterogeneous workloads.

If this is right

Convergence for non-convex objectives holds at O(1/sqrt(R)) whenever staleness remains bounded.
Cutoff admission yields bounded staleness with high probability even when queue predictions contain error.
Real cross-facility training reaches target accuracy with 20.5 percent less wall-clock time than standard baselines.
Under high queue variance and non-IID data the method reduces time to target accuracy by about 34 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar delay-prediction and cutoff mechanisms could be adapted to federated training on cloud clusters where job start times are also stochastic.
The bounded-staleness guarantee may allow tighter theoretical rates if queue predictions are shown to improve over time through online learning.
The approach suggests that explicit modeling of scheduler state can replace reliance on fully synchronous or fully asynchronous protocols in other distributed optimization settings.

Load-bearing premise

Queue delays can be predicted online with accuracy sufficient to keep staleness bounded with high probability.

What would settle it

A cross-facility run in which measured queue-prediction error exceeds the level assumed in the analysis and the maximum observed update staleness grows without bound, causing the measured convergence rate to deviate from O(1/sqrt(R)).

Figures

Figures reproduced from arXiv: 2605.02125 by Emon Dey, Kibaek Kim, Krishnan Raghavan, Ravi Madduri, Yijiang Li, Zilinghan Li.

**Figure 1.** Figure 1: Illustration of the FedQueue Algorithm. The FedQueue server first obtains initial queuing delay estimates for each client (qˆ (0) k ) during a warm-up stage, and accordingly assigns the number of local training steps (E (r) k ). In each round r, the FedQueue server updates the estimates based on the recent queuing delay (q (r) k ) and performs global aggregation using all client updates received before the… view at source ↗

**Figure 2.** Figure 2: Test loss of the federated global models versus wallclock time across all algorithms. There are two configurations of FEDQUEUE tested. FedQueue-1 has (Tsync, δ) = (20min, 1min) while FedQueue-2 has (Tsync, δ) = (40min, 2min). FEDQUEUE starts to achieve lower test loss once it builds accurate queue and compute estimates of the HPC systems and ultimately reaches the smallest test among the algorithms. Addit… view at source ↗

**Figure 3.** Figure 3: Time-to-quality. Validation accuracy vs. elapsed time to reach 95% accuracy under increasing queue variance ρk ( ρ=0.9, ρ=0.5, ρ=0.1.) Faster convergence is achieved with significantly improved resource efficiency. Beyond wall-clock speedups, FEDQUEUE also exhibits superior communication efficiency and local resource utilization. Analysis of the model movement ratio Dr and total local steps #Ek in view at source ↗

**Figure 4.** Figure 4: Impact of admission buffer. (Left) Effect of scaling the buffer δ on histogram of arrival times and (Right) corresponding time to quality (95%). This test exposes the trade-off between better convergence value and convergence time in the presence of client arrival variability. Note that the histograms of δ = 1, 2 are almost identical and overlapped each other. 6.2.3. ABLATION STUDIES There are three main c… view at source ↗

**Figure 5.** Figure 5: Facility-level loss trajectories for all methods. Each panel shows test loss over time for individual facilities (colored dashed lines) and the federated global model (purple solid line). FedQueue configurations achieve better balance between facility utilization and global model quality. D.1. Experimental Overview We use synthetic queue processes which let us vary queue variance, prediction quality, and a… view at source ↗

**Figure 6.** Figure 6: Staleness under queue variance. Empirical CDF of clients arrived beyond Tsync under a sweep of ρ. This plot audits the bounded-staleness behavior induced by ρ. Gamma sweep (time-to-quality) view at source ↗

**Figure 7.** Figure 7: Time-to-quality varying admission parameter. Validation accuracy versus elapsed time under a sweep of admission parameter γ (low/medium/high or a multi-level sweep). FEDQUEUE consistently converges faster and reaches the highest accuracy compared to the baselines for both IID and non-IID cases. Alpha sweep (staleness concentration) view at source ↗

**Figure 8.** Figure 8: Staleness under EWMA rate (α) variation. Empirical CDF of clients arrived beyond Tsync under a sweep of α. This plot audits the bounded-staleness behavior induced by α. Bound verification grid view at source ↗

read the original abstract

Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate $\mathcal{O}(1/\sqrt{R})$ under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of FedQueue shows 20.5% improvement over baseline algorithms. Controlled queue simulations demonstrate robust improvement over the baselines; in particular, about 34% reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedQueue adds online queue prediction and cutoff admission to federated learning to manage HPC scheduler delays, with a standard non-convex convergence result that requires the predictor to actually keep staleness bounded.

read the letter

The paper introduces a protocol that predicts per-facility queue delays to decide local training budgets, applies cutoffs to buffer or drop late updates, and uses staleness-aware aggregation. It proves the usual O(1/sqrt(R)) rate for non-convex objectives under bounded staleness and claims the controls achieve that bound with high probability even with prediction error. Real cross-facility runs show 20.5% improvement over baselines, and simulations under high queue variance report about 34% less time to target accuracy on non-IID data.

Referee Report

2 major / 2 minor

Summary. The paper introduces FedQueue, a queue-aware federated learning protocol for cross-facility HPC environments. It predicts per-facility queue delays online to budget local work, applies cutoff-based admission to buffer late arrivals and bound staleness, and uses staleness-aware aggregation. The authors prove convergence for non-convex objectives at rate O(1/sqrt(R)) under bounded staleness, show that admission controls achieve bounded staleness w.h.p. under queue-prediction error, and report 20.5% improvement in real-world cross-facility deployment plus up to 34% reduction in time-to-accuracy under high queue variance and non-IID partitions in simulations.

Significance. If the convergence result and high-probability staleness bound hold under realistic HPC conditions, the work would be significant for enabling efficient federated training across heterogeneous facilities where scheduler delays dominate wall-clock time. The explicit handling of queue dynamics, combined with a non-convex convergence guarantee and real deployment results, addresses a practical gap in distributed ML on HPC systems.

major comments (2)

[Section 4] Abstract and convergence analysis (Section 4): The O(1/sqrt(R)) rate for non-convex objectives is conditioned on bounded staleness, yet the high-probability argument that cutoff-based admission maintains this bound under queue-prediction error provides no explicit tail bounds on prediction error, no description of the online predictor (features or update rule), and no relation between cutoff thresholds and those errors; this makes it impossible to verify whether the rate survives the high-variance regimes used for the 34% time-to-accuracy claim.
[Section 3.2] Admission control mechanism (Section 3.2): The claim that admission controls yield bounded staleness w.h.p. is load-bearing for both the theoretical guarantee and the reported gains over baselines, but the manuscript supplies neither the predictor's error model nor the tolerance used in the probability argument, leaving open the possibility that realistic HPC queue spikes violate the staleness assumption and invalidate the convergence rate.

minor comments (2)

[Experiments] The experimental section would benefit from additional detail on the exact non-IID partition generation method and the precise definition of 'target accuracy' used for the time-to-accuracy metric.
[Section 3] Notation for staleness (e.g., the maximum age parameter) should be introduced once and used consistently across the proof and algorithm pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical importance of handling queue dynamics in cross-facility federated learning. We address each major comment below. We agree that the manuscript would benefit from additional explicit details on the queue predictor and tail bounds to strengthen the connection between the high-probability staleness guarantee and the reported empirical results; we will incorporate these clarifications in the revised version.

read point-by-point responses

Referee: [Section 4] Abstract and convergence analysis (Section 4): The O(1/sqrt(R)) rate for non-convex objectives is conditioned on bounded staleness, yet the high-probability argument that cutoff-based admission maintains this bound under queue-prediction error provides no explicit tail bounds on prediction error, no description of the online predictor (features or update rule), and no relation between cutoff thresholds and those errors; this makes it impossible to verify whether the rate survives the high-variance regimes used for the 34% time-to-accuracy claim.

Authors: We acknowledge that the current manuscript states the O(1/sqrt(R)) convergence under bounded staleness (Theorem 1 in Section 4) and claims that admission controls achieve bounded staleness w.h.p. under queue-prediction error, but does not supply explicit tail bounds, a full description of the online predictor, or the precise mapping from prediction error to cutoff thresholds. In the revision we will add: (i) a description of the online predictor (historical per-facility queue times with an exponentially-weighted moving average update rule using the last 50 submissions as features); (ii) an explicit sub-Gaussian tail bound on prediction error derived from empirical HPC traces (with variance parameter fitted to the high-variance regime used in the 34% time-to-accuracy experiments); and (iii) the cutoff selection rule (cutoff = predicted delay + 3σ error bound) that ensures the probability of staleness exceeding the theorem's bound is at most δ = 0.05. These additions will make it possible to verify that the O(1/sqrt(R)) rate remains valid in the simulated high-variance, non-IID settings where the 34% improvement was measured. revision: yes
Referee: [Section 3.2] Admission control mechanism (Section 3.2): The claim that admission controls yield bounded staleness w.h.p. is load-bearing for both the theoretical guarantee and the reported gains over baselines, but the manuscript supplies neither the predictor's error model nor the tolerance used in the probability argument, leaving open the possibility that realistic HPC queue spikes violate the staleness assumption and invalidate the convergence rate.

Authors: We agree that the error model and probability tolerance are not stated explicitly. The revision will specify: the prediction error is modeled as sub-Gaussian with parameter σ estimated from real facility logs; the tolerance is set so that P(staleness > B) ≤ 0.05 where B is the bound used in the convergence theorem; and the cutoff is chosen to enforce this probability under the observed queue variance. We will also add a short discussion showing that the same parameter settings reproduce the 20.5% real-world improvement and the 34% simulation gain, thereby confirming that realistic spikes do not invalidate the rate under the chosen admission policy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; convergence holds conditionally on externally verifiable bounded staleness.

full rationale

The paper states a standard non-convex FL convergence rate O(1/sqrt(R)) under the assumption of bounded staleness and separately claims that its admission controls achieve this bound with high probability given queue-prediction error. No equation or step reduces a fitted parameter or self-cited result directly to the target convergence claim by construction. The predictor itself is described at a high level without internal fitting that would force the staleness bound, and the proof is conditioned on an external property rather than derived tautologically from the method's own outputs. This leaves the central result self-contained against external benchmarks for staleness.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the protocol relies on standard FL assumptions plus the new queue-prediction and bounded-staleness conditions.

pith-pipeline@v0.9.0 · 5502 in / 1267 out tokens · 77640 ms · 2026-05-12T04:27:04.523256+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Proceedings of the Twentieth International Conference on Artificial Intelligence and Statistics , series =

Communication-efficient learning of deep networks from decentralized data , author=. Proceedings of the Twentieth International Conference on Artificial Intelligence and Statistics , series =. 2017 , publisher =

work page 2017
[2]

Asynchronous federated optimization,

Asynchronous federated optimization , author=. arXiv preprint arXiv:1903.03934 , year=

work page arXiv 1903
[3]

2022 , publisher =

Nguyen, John and Malik, Kshitiz and Zhan, Hongyuan and Yousefpour, Ashkan and Rabbat, Mike and Malek, Mani and Huba, Dzmitry , booktitle =. 2022 , publisher =

work page 2022
[4]

Li, Zilinghan and Chaturvedi, Pranshu and He, Shilan and Chen, Han and Singh, Gagandeep and Kindratenko, Volodymyr and Huerta, Eliu A and Kim, Kibaek and Madduri, Ravi , booktitle=. Fed

work page
[5]

International Conference on Learning Representations , year=

Adaptive Federated Optimization , author=. International Conference on Learning Representations , year=

work page
[6]

Proceedings of Machine Learning and Systems , volume=

Federated optimization in heterogeneous networks , author=. Proceedings of Machine Learning and Systems , volume=

work page
[7]

International Conference on Machine Learning , pages=

SCAFFOLD: Stochastic controlled averaging for federated learning , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[8]

Advances in Neural Information Processing Systems , volume=

Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) , pages=

FedFa: A Fully Asynchronous Training Paradigm for Federated Learning , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) , pages=

work page
[10]

Chen and Sanzio Bassini and Gabriella Scipione and Jan Martinovi

Iacopo Colonnelli and Robert Birke and Giulio Malenza and Gianluca Mittone and Alberto Mulone and Jeroen Galjaard and Lydia Y. Chen and Sanzio Bassini and Gabriella Scipione and Jan Martinovi. Cross-Facility Federated Learning , url =. Procedia Computer Science , pages =

work page
[11]

2025 , booktitle =

A performance analysis of VM-based Trusted Execution Environments for Confidential Federated Learning , author =. 2025 , booktitle =

work page 2025
[12]

Workshop on Job Scheduling Strategies for Parallel Processing , pages=

SLURM: Simple linux utility for resource management , author=. Workshop on Job Scheduling Strategies for Parallel Processing , pages=. 2003 , organization=

work page 2003
[13]

Job scheduling under the Portable Batch System

Henderson, Robert L. Job scheduling under the Portable Batch System. Job Scheduling Strategies for Parallel Processing. 1995

work page 1995
[14]

Job Scheduling Strategies for Parallel Processing (JSSPP 2001) , series=

Core Algorithms of the Maui Scheduler , author=. Job Scheduling Strategies for Parallel Processing (JSSPP 2001) , series=. 2001 , publisher=

work page 2001
[15]

2015 , publisher=

Workload Modeling for Computer Systems Performance Evaluation , author=. 2015 , publisher=

work page 2015
[16]

Foundations and Trends in Machine Learning , volume=

Advances and open problems in federated learning , author=. Foundations and Trends in Machine Learning , volume=

work page
[17]

Journal of Machine Learning Research , volume=

A general theory for federated optimization with asynchronous and heterogeneous clients updates , author=. Journal of Machine Learning Research , volume=

work page
[18]

IEEE Transactions on Artificial Intelligence , year=

Asynchronous Federated Learning with nonconvex client objective functions and heterogeneous dataset , author=. IEEE Transactions on Artificial Intelligence , year=

work page
[19]

Yu, Jieling and Zhou, Ruiting and Chen, Chen and Li, Bo and Dong, Fang , booktitle=

work page
[20]

Journal of Parallel and Distributed Computing , volume=

Staleness aware semi-asynchronous federated learning , author=. Journal of Parallel and Distributed Computing , volume=. 2024 , publisher=

work page 2024
[21]

Liu, Ji and Jia, Juncheng and Che, Tianshi and Huo, Chao and Ren, Jiaxiang and Zhou, Yang and Dai, Huaiyu and Dou, Dejing , booktitle=

work page
[22]

ACM Transactions on Intelligent Systems and Technology (TIST) , volume=

Fleet: Online federated learning via staleness awareness and performance prediction , author=. ACM Transactions on Intelligent Systems and Technology (TIST) , volume=. 2022 , publisher=

work page 2022
[23]

IEEE Journal of Selected Topics in Signal Processing , volume=

Federated learning under intermittent client availability and time-varying communication constraints , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

work page 2022
[24]

Advances in Neural Information Processing Systems , volume=

A unified analysis of federated learning with arbitrary client participation , author=. Advances in Neural Information Processing Systems , volume=

work page
[25]

Proceedings of the 5th Workshop on Machine Learning and Systems , pages=

Client Availability in Federated Learning: It Matters! , author=. Proceedings of the 5th Workshop on Machine Learning and Systems , pages=

work page
[26]

Predicting batch queue job wait times for informed scheduling of urgent

Brown, Nick and Gibb, Gordon and Belikov, Evgenij and Nash, Rupert , journal=. Predicting batch queue job wait times for informed scheduling of urgent

work page
[27]

Concurrency and Computation: Practice and Experience , volume=

Predicting accurate batch queue wait times on production supercomputers by combining machine learning techniques , author=. Concurrency and Computation: Practice and Experience , volume=. 2024 , publisher=

work page 2024
[28]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=

Predicting queue wait time probabilities for multi-scale computing , author=. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=. 2019 , publisher=

work page 2019
[29]

Baker and Ziqi Chen and Xia Ning and Huan Sun , booktitle=

Botao Yu and Frazier N. Baker and Ziqi Chen and Xia Ning and Huan Sun , booktitle=. Lla. 2024 , url=

work page 2024
[30]

Scientific Reports , volume=

Incentive mechanism of foundation model enabled cross-silo federated learning , author=. Scientific Reports , volume=

work page
[31]

When foundation model meets federated learning: Motivations, challenges, and future directions.arXiv preprint arXiv:2306.15546,

When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions , author=. arXiv preprint arXiv:2306.15546 , year=

work page arXiv
[32]

2024 IEEE International Conference on Big Data (BigData) , pages=

Privacy-preserving federated learning for science: Challenges and research directions , author=. 2024 IEEE International Conference on Big Data (BigData) , pages=. 2024 , organization=

work page 2024
[33]

2023 , journal=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , journal=

work page 2023
[34]

2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid) , pages=

Advances in APPFL: A comprehensive and extensible federated learning framework , author=. 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid) , pages=. 2025 , organization=

work page 2025
[35]

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , pages=

APPFL: open-source software framework for privacy-preserving federated learning , author=. 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) , pages=. 2022 , organization=

work page 2022
[36]

and Nagaitsev, Kirill and Woodard, Anna and Blaiszik, Ben and Bryan, Josh and Katz, Daniel S

Li, Zhuozhao and Chard, Ryan and Babuji, Yadu and Galewsky, Ben and Skluzacek, Tyler J. and Nagaitsev, Kirill and Woodard, Anna and Blaiszik, Ben and Bryan, Josh and Katz, Daniel S. and Foster, Ian and Chard, Kyle , journal=. func. 2022 , publisher=

work page 2022
[37]

The International Journal of High Performance Computing Applications , volume =

Weijian Zheng and Jack Kordas and Tyler J Skluzacek and Raj Kettimuthu and Ian Foster , title =. The International Journal of High Performance Computing Applications , volume =. 2024 , doi =

work page 2024
[38]

2026 , journal=

Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers , author=. 2026 , journal=

work page 2026