StreamGuard: Low-Overhead Resilience for Real-time HPC Data Streams

Amal Gueroudji; Bogdan Nicolae; Hai Duc Nguyen; Ian Foster; Kyle Chard; Matthieu Dorier; Tekin Bicer

arxiv: 2606.30848 · v1 · pith:LY4RGFILnew · submitted 2026-06-29 · 💻 cs.DC

StreamGuard: Low-Overhead Resilience for Real-time HPC Data Streams

Hai Duc Nguyen , Bogdan Nicolae , Tekin Bicer , Amal Gueroudji , Matthieu Dorier , Kyle Chard , Ian Foster This is my paper

Pith reviewed 2026-07-01 01:20 UTC · model grok-4.3

classification 💻 cs.DC

keywords resiliencecheckpointingload redistributionHPC data streamsreal-time workflowsfault toleranceproducer-consumer patternperformance anomalies

0 comments

The pith

Two mechanisms keep real-time HPC data streams progressing through failures with under 1% normal overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to protect producer-consumer streaming workflows, a core part of scientific data processing, from hardware faults, network issues, and performance slowdowns that break real-time deadlines. It does this by adding a checkpointing method that saves state without pausing work and a redistribution method that spots lagging tasks and moves them to faster workers. A sympathetic reader would care because these steps let the workflow keep producing results on time even when the underlying machines are unreliable. If the techniques work as described, they deliver up to six times less slowdown during problems while adding almost no cost when the system runs normally.

Core claim

The paper claims that a dynamic asynchronous non-blocking checkpointing mechanism together with a progress-aware load redistribution strategy maintains forward progress and balanced execution for the producer-consumer streaming pattern even in highly error-prone environments, reducing the impact of failures and performance anomalies by up to 6x while adding less than 1% overhead during failure-free runs.

What carries the argument

The pair of dynamic asynchronous non-blocking checkpointing that preserves progress without interrupting computation, plus progress-aware load redistribution that detects slow workers and rebalances tasks.

If this is right

Real-time constraints remain satisfied because forward progress continues despite faults.
The producer-consumer pattern stays balanced, preventing single slow workers from stalling the entire stream.
Overall workflow output quality stays high with minimal extra resource use when no failures happen.
The same mechanisms apply directly to other continuous data-stream scientific workflows built on the same pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on workflows that use multiple streaming patterns at once to see whether the two techniques still compose cleanly.
In very large clusters the redistribution logic might need extra coordination to avoid creating new bottlenecks during rebalancing.
The low overhead opens the possibility of always-on resilience even on systems where failures are rare but expensive when they occur.

Load-bearing premise

That the checkpointing can save state without any interruption to computation and that the redistribution can reliably detect and correct slow workers even when many failures occur at once.

What would settle it

Measure the observed slowdown factor under injected failures and the overhead percentage in clean runs; if the slowdown reduction stays below 6x or the clean-run overhead exceeds 1%, the central performance claim does not hold.

Figures

Figures reproduced from arXiv: 2606.30848 by Amal Gueroudji, Bogdan Nicolae, Hai Duc Nguyen, Ian Foster, Kyle Chard, Matthieu Dorier, Tekin Bicer.

**Figure 2.** Figure 2: Resilience solutions: (i) Checkpoint and retry for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: StreamGuard implementation for an individual [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: The Tomographic Reconstruction Workflow 7 Evaluation 7.1 Methodology Objectives. The evaluation assesses whether StreamGuard improves reliability in scientific streaming workflows without compromising performance. Our methodology is designed to isolate the impact of resilience mechanisms from application behavior and system noise, allowing us to evaluate both effectiveness and efficiency under controlle… view at source ↗

**Figure 6.** Figure 6: Dynamic checkpointing efficiency. StreamGuard maintains processing time close to ideal execution (black dashed [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Non-blocking asynchronous checkpoint efficiency. StreamGuard allows each consumer (reconstruction worker) to [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Dynamic load-balancing efficiency. Load rebalancing based on worker speed and per-partition progress significantly [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Resilience robustness. StreamGuard maintains acceptable processing time and satisfies all real-time deadlines across [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Real-time scientific workflows operate on continuous data streams and must produce timely, high-quality results despite executing on complex, failure-prone infrastructure. Hardware faults, network disruptions, and performance anomalies caused by resource contention or system heterogeneity can severely degrade performance and violate real-time constraints. We focus on strengthening the resilience of the producer-consumer streaming pattern, a fundamental building block of scientific streaming workflows. We present two complementary techniques: (i) a dynamic, asynchronous, non-blocking checkpointing mechanism that preserves progress without interrupting computation, and (ii) a progress-aware load redistribution strategy that detects slow workers and proactively rebalances tasks. Together, these mechanisms maintain forward progress and balanced execution even in highly error-prone environments. Experimental results show that our approach reduces the impact of failures and performance anomalies by up to 6x, while introducing less than 1% overhead in failure-free execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StreamGuard applies familiar checkpointing and load-balancing ideas to producer-consumer streams in real-time HPC and reports up to 6x resilience gains with under 1% overhead, but the abstract supplies no experimental details so the numbers cannot be checked.

read the letter

The main takeaway is that this paper takes two established distributed-systems techniques—dynamic asynchronous non-blocking checkpointing and progress-aware task redistribution—and targets them at the producer-consumer pattern common in scientific streaming workflows. The abstract positions the combination as a way to keep forward progress on failure-prone hardware without stopping computation or adding much cost in the normal case.

What the work does is identify a concrete setting where real-time constraints meet heterogeneous, error-prone resources and then describe mechanisms that try to handle both hardware faults and performance anomalies. The non-blocking checkpoint aims to preserve state without interrupting the stream, while the load redistribution watches progress to move work away from slow nodes. If those mechanisms actually deliver the claimed behavior, the result would be a practical engineering contribution for people running continuous data pipelines on HPC clusters.

The clear limitation is that the abstract states quantitative outcomes—6x reduction in failure impact and less than 1% overhead—yet gives no information on workloads, failure models, baselines, measurement methods, or statistical treatment. Without those details it is impossible to know whether the numbers are robust or whether the techniques introduce hidden costs under realistic contention. The assumption that the mechanisms remain effective in highly error-prone environments therefore rests on unshown evidence.

This paper is aimed at distributed-systems and HPC practitioners who already manage streaming workflows and need resilience options that do not break real-time guarantees. A reader in that group could extract implementation sketches, but the lack of visible experimental grounding limits how much weight the claims can carry right now.

I would send the manuscript to peer review because the problem is relevant and the approach is straightforward; a referee could check whether the full experimental section actually supports the headline numbers. On the current text alone the evidence is too thin to form a firm opinion.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes StreamGuard for resilience in real-time HPC data streams. It introduces two techniques: (i) dynamic asynchronous non-blocking checkpointing that preserves progress without interrupting computation, and (ii) progress-aware load redistribution that detects slow workers and rebalances tasks. The central empirical claim is that the combined approach reduces the impact of failures and performance anomalies by up to 6x while introducing less than 1% overhead in failure-free execution.

Significance. Resilience for continuous data streams on failure-prone HPC infrastructure addresses a practical need in scientific workflows. If the low-overhead claims are substantiated by rigorous experiments, the work could provide a useful building block for producer-consumer streaming patterns.

major comments (1)

[Abstract] Abstract: The manuscript asserts specific quantitative experimental results (up to 6x reduction in failure impact, <1% overhead) but supplies no details on benchmarks, workloads, failure models, or statistical methods. This prevents any determination of whether the data support the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying the need for clearer linkage between the abstract claims and the supporting experimental details. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts specific quantitative experimental results (up to 6x reduction in failure impact, <1% overhead) but supplies no details on benchmarks, workloads, failure models, or statistical methods. This prevents any determination of whether the data support the central claim.

Authors: The abstract is a concise summary by design. The full manuscript supplies the requested details in the body: Section 4 specifies the benchmarks (synthetic and real scientific streaming workloads with given data rates and task granularities), Section 5 describes the failure models (transient node crashes, network delays, and contention-induced slowdowns) together with the measurement protocol (repeated trials, mean and standard deviation, 95% confidence intervals). These sections directly support the quantitative claims. We therefore see no need to alter the manuscript on this point, though we can add a one-sentence pointer from the abstract to Section 4 if the editor prefers. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes two resilience techniques for streaming workflows and supports its claims solely with experimental measurements (up to 6x impact reduction, <1% overhead). No equations, derivations, fitted parameters, or load-bearing self-citations appear in the abstract or described content. The central results are empirical outcomes rather than any claimed first-principles reduction, so the derivation chain is empty and the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or background assumptions that can be audited; all ledger fields remain empty.

pith-pipeline@v0.9.1-grok · 5694 in / 1074 out tokens · 42175 ms · 2026-07-01T01:20:40.885493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in{Map-Reduce} clusters using mantri. InOSDI’10: The 9th USENIX Symposium on Operating Systems Design and Implementation. Vancouver, Canada, 265–278

2010
[2]

Apache Flink. [n. d.]. Checkpoints. https://nightlies.apache.org/flink/flink-docs- stable/docs/ops/state/checkpoints/. Apache Flink Documentation, accessed: 2026-02-09

2026
[3]

Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured stream- ing: A declarative api for real-time applications in apache spark. InSIGMOD’18: The 2018 ACM SIGMOD International Conference on Management of Data. Houston, USA, 601–613

2018
[4]

Anakha V Babu, Tao Zhou, Saugat Kandel, Tekin Bicer, Zhengchun Liu, William Judge, Daniel J Ching, Yi Jiang, Sinisa Veseli, Steven Henke, et al. 2023. Deep learning at the edge enables real-time streaming ptychographic imaging.Nature Communications14, 1 (2023), 7059

2023
[5]

Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M Wozniak, Ian Foster, et al. 2019. Parsl: Pervasive parallel programming in python. InHPDC’19: The 28th Inter- national Symposium on High-Performance Parallel and Distributed Computing. Phoenix, USA, 25–36

2019
[6]

GAUDI& Barrand, I Belyaev, P Binko, M Cattaneo, R Chytracek, G Corti, M Frank, G Gracia, J Harvey, Eric Van Herwijnen, et al. 2001. GAUDI—A software archi- tecture and framework for building HEP data processing applications.Computer physics communications140, 1-2 (2001), 45–55

2001
[7]

Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTI: High Performance Fault Tolerance Interface for Hybrid Systems. InSC’11: The International Conference for High Performance Computing, Networking, Storage and Analysis. Seattle, USA, 32:1–32:32

2011
[8]

Anne Benoit, Yishu Du, Thomas Herault, Loris Marchal, Guillaume Pallez, Lucas Perotin, Yves Robert, Hongyang Sun, and Frédéric Vivien. 2022. Checkpointing à la Young/Daly: An overview. InIC3’22: The 14th International Conference on Contemporary Computing. Noida, India, 701–710

2022
[9]

Anne Benoit, Lucas Perotin, Yves Robert, and Frédéric Vivien. 2024. Checkpoint- ing Strategies to Tolerate Non-Memoryless Failures on HPC Platforms.ACM Trans. Parallel Comput.11, 1 (2024), 1:1–1:26

2024
[10]

Tekin Bicer, Doğa Gürsoy, Vincent De Andrade, Rajkumar Kettimuthu, William Scullin, Francesco De Carlo, and Ian T Foster. 2017. Trace: A high-throughput tomographic reconstruction engine for large-scale datasets.Advanced Structural and Chemical Imaging3 (2017), 1–10

2017
[11]

Tekin Bicer, Wei Jiang, and Gagan Agrawal. 2010. Supporting fault tolerance in a data-intensive computing middleware. InIPDPS’10: The 24th IEEE International Symposium on Parallel and Distributed Processing. Atlanta, USA, 1–12

2010
[12]

Tekin Bicer, Viktor Nikitin, Selin Aslan, Doga Gürsoy, Rajkumar Kettimuthu, and Ian T Foster. 2020. Tomographic Reconstruction of Dynamic Features with Streaming Sliding Subsets. InXLOOP’20: The 2nd IEEE/ACM Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing. Virtual Event, 8–15

2020
[13]

Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update.Supercomputing Frontiers and Innovations: An International Journal1, 1 (2014), 5–28

2014
[14]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine.The Bulletin of the Technical Committee on Data Engineering38, 4 (2015), 28–38

2015
[15]

K Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems.ACM Transactions on Computer Systems (TOCS)3, 1 (1985), 63–75

1985
[16]

John T Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps.Future Generation Computer Systems22, 3 (2006), 303–312

2006
[17]

Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale.Commun. ACM56, 2 (2013), 74–80

2013
[18]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters.Commun. ACM51, 1 (2008), 107–113

2008
[19]

Martin Dierolf, Andreas Menzel, Pierre Thibault, Philipp Schneider, Cameron M Kewish, Roger Wepf, Oliver Bunk, and Franz Pfeiffer. 2010. Ptychographic X-ray computed tomography at the nanoscale.Nature467, 7314 (2010), 436–439

2010
[20]

2015.Fault tolerance techniques for high-performance computing

Jack Dongarra, Thomas Herault, and Yves Robert. 2015.Fault tolerance techniques for high-performance computing. Springer

2015
[21]

Matthieu Dorier, Amal Gueroudji, Valérie Hayot-Sasson, Hai Duc Nguyen, Seth Ockerman, Renan Souza, Tekin Bicer, Haochen Pan, Philip Carns, Kyle Chard, et al
[22]

Toward a persistent event-streaming system for high-performance com- puting applications.Frontiers in High Performance Computing3 (2025), 1638203

2025
[23]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Kurt Ferreira, Jon Stearley, James H Laros III, Ron Oldfield, Kevin Pedretti, Ron Brightwell, Rolf Riesen, Patrick G Bridges, and Dorian Arnold. 2011. Evaluating the viability of process replication reliability for exascale systems. InSC’11: The International Conference for High Performance Computing, Networking, Storage and Analysis. Seattle, USA, 1–12

2011
[25]

Michael Goodrich, Carl Timmer, Vardan Gyurjyan, David Lawrence, Graham Heyes, Yatish Kumar, and Stacey Sheldon. 2023. ESnet/JLab FPGA Accelerated Transport.IEEE Transactions on Nuclear Science70, 6 (2023), 1096–1101

2023
[26]

Google Cloud. 2026. Google Cloud Dataflow. https://cloud.google.com/products/ dataflow. Accessed: 2026-02-09

2026
[27]

Gossman, Bogdan Nicolae, and Jon C

Mikaila J. Gossman, Bogdan Nicolae, and Jon C. Calhoun. 2024. Scalable I/O aggregation for asynchronous multi-level checkpointing.Future Generation Computer Systems160 (2024), 420–432

2024
[28]

Amal Gueroudji, Matthieu Dorier, Philip Carns, Parth Patel, Tekin Bicer, Robert Latham, Robert Ross, Kyle Chard, and Ian Foster. 2025. RESILIO: A Scalable and Composable Architecture for Tomographic Reconstruction Workflows. InSC- W’25: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis, U...

2025
[29]

Salman Habib, Vitali Morozov, Nicholas Frontiere, Hal Finkel, Adrian Pope, and Katrin Heitmann. 2013. HACC: Extreme scaling and performance across diverse architectures. InSC’13: The International Conference for High Performance Computing, Networking, Storage and Analysis. Denver, USA, 1–10

2013
[30]

Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingneng Dang
[31]

InHotOS’17: The 16th Workshop on Hot Topics in Operating Systems

Gray failure: The Achilles’ heel of cloud-scale systems. InHotOS’17: The 16th Workshop on Hot Topics in Operating Systems. Whistler, Canada, 150–155
[32]

Eliu Antonio Huerta, Gabrielle Allen, Igor Andreoni, Javier M Antelis, Eti- enne Bachelet, G Bruce Berriman, Federica B Bianco, Rahul Biswas, Matias Carrasco Kind, Kyle Chard, et al . 2019. Enabling real-time multi-messenger astrophysics discoveries with deep learning.Nature Reviews Physics1, 10 (2019), 600–608

2019
[33]

Yongho Kim, Seongha Park, Sean Shahkarami, Rajesh Sankaran, Nicola Ferrier, and Pete Beckman. 2022. Goal-driven scheduling model in edge computing for smart city applications.J. Parallel and Distrib. Comput.167 (2022), 97–108

2022
[34]

Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja
[35]

InSIGMOD’15: The 2015 ACM SIGMOD International Conference on Management of Data

Twitter heron: Stream processing at scale. InSIGMOD’15: The 2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 239–250

2015
[36]

Yifei Li, Ryan Chard, Yadu Babuji, Kyle Chard, Ian Foster, and Zhuozhao Li. 2024. UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving. InIPDPS’24: The 38th IEEE International Parallel and Distributed Processing Symposium. San Francisco, USA, 217–229

2024
[37]

Gang Liu, Zeting Wang, Amelie Chi Zhou, and Rui Mao. 2024. Adaptive key par- titioning in distributed stream processing.CCF Transactions on High Performance Computing6, 2 (2024), 164–178

2024
[38]

Zhengchun Liu, Tekin Bicer, Rajkumar Kettimuthu, Doga Gursoy, Francesco De Carlo, and Ian Foster. 2020. TomoGAN: Low-dose synchrotron x-ray tomog- raphy with generative adversarial networks.JOSA A37, 3 (2020), 422–434

2020
[39]

Pengqi Lu, Yue Yue, Liang Yuan, and Yunquan Zhang. 2021. AutoFlow: hotspot- aware, dynamic load balancing for distributed stream processing. InICA3PP’21: The 21st International Conference on Algorithms and Architectures for Parallel Processing. Virtual Event, 133–151

2021
[40]

Smart, Emanuele Danovaro, Tiago Quintino, Dean Hildebrand, and Adrian Jackson

Nicolau Manubens, Johann Lombardi, Simon D. Smart, Emanuele Danovaro, Tiago Quintino, Dean Hildebrand, and Adrian Jackson. 2024. Exploring DAOS Interfaces and Performance. arXiv:2409.18682 [cs.DC]

work page arXiv 2024
[41]

Mustafa Rafique, Franck Cappello, and Bogdan Nicolae

Avinash Maurya, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2023. Towards Efficient I/O Pipelines using Accumulated Compression. InHIPC’23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics. Goa, India, 256–265

2023
[42]

Mohror, A

K. Mohror, A. Moody, G. Bronevetsky, and B. R. de Supinski. 2014. Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System.IEEE Transactions on Parallel and Distributed Systems25, 09 (2014), 2255–2263

2014
[43]

Hai Duc Nguyen, Tekin Bicer, Bogdan Nicolae, Rajkumar Kettimuthu, Eliu A Huerta, and Ian T Foster. 2025. Resilient execution of distributed X-ray image analysis workflows.Frontiers in High Performance Computing3 (2025), 1550855

2025
[44]

Hai Duc Nguyen, Zhifei Yang, and Andrew A Chien. 2021. Motivating high per- formance serverless workloads. InHiPS’21: The 1st Workshop on High Performance Serverless Computing. Virtual Event, 25–32

2021
[45]

Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Franck Cappello. 2019. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. InIPDPS’19: The 2019 IEEE International Parallel and Distributed Processing Symposium. Rio de Janeiro, Brazil, 911–920

2019
[46]

Bogdan Nicolae, Adam Moody, Greg Kosinovsky, Kathryn Mohror, and Franck Cappello. 2021. VELOC: VEry Low Overhead Checkpointing in the Age of Exascale. InSuperCheck’21: The First International Symposium on Checkpointing for Supercomputing. Virtual Event

2021
[47]

Bogdan Nicolae, Justin M Wozniak, Tekin Bicer, Hai Nguyen, Parth Patel, Haochen Pan, Amal Gueroudji, Maxime Gonthier, Valerie Hayot-Sasson, Eliu Huerta, et al. 12
[48]

Ine-Science’24: The 20th IEEE International Conference on e-Science

Diaspora: Resilience-enabling services for real-time distributed workflows. Ine-Science’24: The 20th IEEE International Conference on e-Science. Osaka, Japan, 1–9
[49]

Shadi A Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H Campbell. 2017. Samza: stateful scalable stream processing at LinkedIn.Proceedings of the VLDB Endowment10, 12 (2017), 1634– 1645

2017
[50]

Norbert Podhorszki et al. 2020. ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management.SoftwareX12 (2020)

2020
[51]

Michael Prince, Doğa Gürsoy, Dina Sheyfer, Ryan Chard, Benoit Côté, Hannah Parraga, Barbara Frosik, Jon Tischler, and Nicholas Schwarz. 2023. Demonstrating Cross-Facility Data Processing at Scale With Laue Microdiffraction. InSC-W’23: Workshops of the International Conference for High Performance Computing, Net- work, Storage, and Analysis. Denver, USA, 2133–2139

2023
[52]

Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. InSoCC’12: The 3rd ACM Symposium on Cloud Computing. San Jose, USA, 1–13

2012
[53]

Richard D Schlichting and Fred B Schneider. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems.ACM Transactions on Computer Systems1, 3 (1983), 222–238

1983
[54]

Won Wook Song, Taegeon Um, Sameh Elnikety, Myeongjae Jeon, and Byung-Gon Chun. 2023. Sponge: Fast reactive scaling for stream processing with serverless frameworks. InUSENIX ATC’23: The 2023 USENIX Annual Technical Conference. Boston, USA, 301–314

2023
[55]

Sebastian Steiner, Jakob Wolf, Stefan Glatzel, Anna Andreou, Jarosław M Granda, Graham Keenan, Trevor Hinkley, Gerardo Aragon-Camarasa, Philip J Kitson, Davide Angelone, et al. 2019. Organic synthesis in a modular robotic system driven by a chemical programming language.Science363, 6423 (2019), eaav2211

2019
[56]

Dawei Sun, Chunlin Zhang, Shang Gao, and Rajkumar Buyya. 2024. An adap- tive load balancing strategy for stateful join operator in skewed data stream environments.Future generation computer systems152 (2024), 138–151

2024
[57]

P Thibault, M Dierolf, A Menzel, O Bunk, C David, and F Pfeiffer. 2008. High- resolution scanning x-ray diffraction microscopy.Science321, 5887 (2008), 379– 382

2008
[58]

Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. InSIGMOD’14: The 2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 147–156

2014
[59]

Xiaoli Yan, Nathaniel Hudson, Hyun Park, Daniel Grzenda, J Gregory Pauloski, Marcus Schwarting, Haochen Pan, Hassan Harb, Samuel Foreman, Chris Knight, et al. 2025. MOFA: Discovering Materials for Carbon Capture with a GenAI-and Simulation-Based Workflow.arXiv preprint arXiv:2501.10651(2025)

work page arXiv 2025
[60]

John W Young. 1974. A first order approximation to the optimum checkpoint interval.Commun. ACM17, 9 (1974), 530–531

1974
[61]

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. InSOSP’13: The 24th ACM Symposium on Operating Systems Principles. Farmington, USA, 423–438. 13

2013

[1] [1]

Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in{Map-Reduce} clusters using mantri. InOSDI’10: The 9th USENIX Symposium on Operating Systems Design and Implementation. Vancouver, Canada, 265–278

2010

[2] [2]

Apache Flink. [n. d.]. Checkpoints. https://nightlies.apache.org/flink/flink-docs- stable/docs/ops/state/checkpoints/. Apache Flink Documentation, accessed: 2026-02-09

2026

[3] [3]

Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured stream- ing: A declarative api for real-time applications in apache spark. InSIGMOD’18: The 2018 ACM SIGMOD International Conference on Management of Data. Houston, USA, 601–613

2018

[4] [4]

Anakha V Babu, Tao Zhou, Saugat Kandel, Tekin Bicer, Zhengchun Liu, William Judge, Daniel J Ching, Yi Jiang, Sinisa Veseli, Steven Henke, et al. 2023. Deep learning at the edge enables real-time streaming ptychographic imaging.Nature Communications14, 1 (2023), 7059

2023

[5] [5]

Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M Wozniak, Ian Foster, et al. 2019. Parsl: Pervasive parallel programming in python. InHPDC’19: The 28th Inter- national Symposium on High-Performance Parallel and Distributed Computing. Phoenix, USA, 25–36

2019

[6] [6]

GAUDI& Barrand, I Belyaev, P Binko, M Cattaneo, R Chytracek, G Corti, M Frank, G Gracia, J Harvey, Eric Van Herwijnen, et al. 2001. GAUDI—A software archi- tecture and framework for building HEP data processing applications.Computer physics communications140, 1-2 (2001), 45–55

2001

[7] [7]

Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTI: High Performance Fault Tolerance Interface for Hybrid Systems. InSC’11: The International Conference for High Performance Computing, Networking, Storage and Analysis. Seattle, USA, 32:1–32:32

2011

[8] [8]

Anne Benoit, Yishu Du, Thomas Herault, Loris Marchal, Guillaume Pallez, Lucas Perotin, Yves Robert, Hongyang Sun, and Frédéric Vivien. 2022. Checkpointing à la Young/Daly: An overview. InIC3’22: The 14th International Conference on Contemporary Computing. Noida, India, 701–710

2022

[9] [9]

Anne Benoit, Lucas Perotin, Yves Robert, and Frédéric Vivien. 2024. Checkpoint- ing Strategies to Tolerate Non-Memoryless Failures on HPC Platforms.ACM Trans. Parallel Comput.11, 1 (2024), 1:1–1:26

2024

[10] [10]

Tekin Bicer, Doğa Gürsoy, Vincent De Andrade, Rajkumar Kettimuthu, William Scullin, Francesco De Carlo, and Ian T Foster. 2017. Trace: A high-throughput tomographic reconstruction engine for large-scale datasets.Advanced Structural and Chemical Imaging3 (2017), 1–10

2017

[11] [11]

Tekin Bicer, Wei Jiang, and Gagan Agrawal. 2010. Supporting fault tolerance in a data-intensive computing middleware. InIPDPS’10: The 24th IEEE International Symposium on Parallel and Distributed Processing. Atlanta, USA, 1–12

2010

[12] [12]

Tekin Bicer, Viktor Nikitin, Selin Aslan, Doga Gürsoy, Rajkumar Kettimuthu, and Ian T Foster. 2020. Tomographic Reconstruction of Dynamic Features with Streaming Sliding Subsets. InXLOOP’20: The 2nd IEEE/ACM Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing. Virtual Event, 8–15

2020

[13] [13]

Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update.Supercomputing Frontiers and Innovations: An International Journal1, 1 (2014), 5–28

2014

[14] [14]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine.The Bulletin of the Technical Committee on Data Engineering38, 4 (2015), 28–38

2015

[15] [15]

K Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems.ACM Transactions on Computer Systems (TOCS)3, 1 (1985), 63–75

1985

[16] [16]

John T Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps.Future Generation Computer Systems22, 3 (2006), 303–312

2006

[17] [17]

Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale.Commun. ACM56, 2 (2013), 74–80

2013

[18] [18]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters.Commun. ACM51, 1 (2008), 107–113

2008

[19] [19]

Martin Dierolf, Andreas Menzel, Pierre Thibault, Philipp Schneider, Cameron M Kewish, Roger Wepf, Oliver Bunk, and Franz Pfeiffer. 2010. Ptychographic X-ray computed tomography at the nanoscale.Nature467, 7314 (2010), 436–439

2010

[20] [20]

2015.Fault tolerance techniques for high-performance computing

Jack Dongarra, Thomas Herault, and Yves Robert. 2015.Fault tolerance techniques for high-performance computing. Springer

2015

[21] [21]

Matthieu Dorier, Amal Gueroudji, Valérie Hayot-Sasson, Hai Duc Nguyen, Seth Ockerman, Renan Souza, Tekin Bicer, Haochen Pan, Philip Carns, Kyle Chard, et al

[22] [22]

Toward a persistent event-streaming system for high-performance com- puting applications.Frontiers in High Performance Computing3 (2025), 1638203

2025

[23] [23]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Kurt Ferreira, Jon Stearley, James H Laros III, Ron Oldfield, Kevin Pedretti, Ron Brightwell, Rolf Riesen, Patrick G Bridges, and Dorian Arnold. 2011. Evaluating the viability of process replication reliability for exascale systems. InSC’11: The International Conference for High Performance Computing, Networking, Storage and Analysis. Seattle, USA, 1–12

2011

[25] [25]

Michael Goodrich, Carl Timmer, Vardan Gyurjyan, David Lawrence, Graham Heyes, Yatish Kumar, and Stacey Sheldon. 2023. ESnet/JLab FPGA Accelerated Transport.IEEE Transactions on Nuclear Science70, 6 (2023), 1096–1101

2023

[26] [26]

Google Cloud. 2026. Google Cloud Dataflow. https://cloud.google.com/products/ dataflow. Accessed: 2026-02-09

2026

[27] [27]

Gossman, Bogdan Nicolae, and Jon C

Mikaila J. Gossman, Bogdan Nicolae, and Jon C. Calhoun. 2024. Scalable I/O aggregation for asynchronous multi-level checkpointing.Future Generation Computer Systems160 (2024), 420–432

2024

[28] [28]

Amal Gueroudji, Matthieu Dorier, Philip Carns, Parth Patel, Tekin Bicer, Robert Latham, Robert Ross, Kyle Chard, and Ian Foster. 2025. RESILIO: A Scalable and Composable Architecture for Tomographic Reconstruction Workflows. InSC- W’25: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis, U...

2025

[29] [29]

Salman Habib, Vitali Morozov, Nicholas Frontiere, Hal Finkel, Adrian Pope, and Katrin Heitmann. 2013. HACC: Extreme scaling and performance across diverse architectures. InSC’13: The International Conference for High Performance Computing, Networking, Storage and Analysis. Denver, USA, 1–10

2013

[30] [30]

Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingneng Dang

[31] [31]

InHotOS’17: The 16th Workshop on Hot Topics in Operating Systems

Gray failure: The Achilles’ heel of cloud-scale systems. InHotOS’17: The 16th Workshop on Hot Topics in Operating Systems. Whistler, Canada, 150–155

[32] [32]

Eliu Antonio Huerta, Gabrielle Allen, Igor Andreoni, Javier M Antelis, Eti- enne Bachelet, G Bruce Berriman, Federica B Bianco, Rahul Biswas, Matias Carrasco Kind, Kyle Chard, et al . 2019. Enabling real-time multi-messenger astrophysics discoveries with deep learning.Nature Reviews Physics1, 10 (2019), 600–608

2019

[33] [33]

Yongho Kim, Seongha Park, Sean Shahkarami, Rajesh Sankaran, Nicola Ferrier, and Pete Beckman. 2022. Goal-driven scheduling model in edge computing for smart city applications.J. Parallel and Distrib. Comput.167 (2022), 97–108

2022

[34] [34]

Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja

[35] [35]

InSIGMOD’15: The 2015 ACM SIGMOD International Conference on Management of Data

Twitter heron: Stream processing at scale. InSIGMOD’15: The 2015 ACM SIGMOD International Conference on Management of Data. Melbourne, Australia, 239–250

2015

[36] [36]

Yifei Li, Ryan Chard, Yadu Babuji, Kyle Chard, Ian Foster, and Zhuozhao Li. 2024. UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving. InIPDPS’24: The 38th IEEE International Parallel and Distributed Processing Symposium. San Francisco, USA, 217–229

2024

[37] [37]

Gang Liu, Zeting Wang, Amelie Chi Zhou, and Rui Mao. 2024. Adaptive key par- titioning in distributed stream processing.CCF Transactions on High Performance Computing6, 2 (2024), 164–178

2024

[38] [38]

Zhengchun Liu, Tekin Bicer, Rajkumar Kettimuthu, Doga Gursoy, Francesco De Carlo, and Ian Foster. 2020. TomoGAN: Low-dose synchrotron x-ray tomog- raphy with generative adversarial networks.JOSA A37, 3 (2020), 422–434

2020

[39] [39]

Pengqi Lu, Yue Yue, Liang Yuan, and Yunquan Zhang. 2021. AutoFlow: hotspot- aware, dynamic load balancing for distributed stream processing. InICA3PP’21: The 21st International Conference on Algorithms and Architectures for Parallel Processing. Virtual Event, 133–151

2021

[40] [40]

Smart, Emanuele Danovaro, Tiago Quintino, Dean Hildebrand, and Adrian Jackson

Nicolau Manubens, Johann Lombardi, Simon D. Smart, Emanuele Danovaro, Tiago Quintino, Dean Hildebrand, and Adrian Jackson. 2024. Exploring DAOS Interfaces and Performance. arXiv:2409.18682 [cs.DC]

work page arXiv 2024

[41] [41]

Mustafa Rafique, Franck Cappello, and Bogdan Nicolae

Avinash Maurya, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae. 2023. Towards Efficient I/O Pipelines using Accumulated Compression. InHIPC’23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics. Goa, India, 256–265

2023

[42] [42]

Mohror, A

K. Mohror, A. Moody, G. Bronevetsky, and B. R. de Supinski. 2014. Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System.IEEE Transactions on Parallel and Distributed Systems25, 09 (2014), 2255–2263

2014

[43] [43]

Hai Duc Nguyen, Tekin Bicer, Bogdan Nicolae, Rajkumar Kettimuthu, Eliu A Huerta, and Ian T Foster. 2025. Resilient execution of distributed X-ray image analysis workflows.Frontiers in High Performance Computing3 (2025), 1550855

2025

[44] [44]

Hai Duc Nguyen, Zhifei Yang, and Andrew A Chien. 2021. Motivating high per- formance serverless workloads. InHiPS’21: The 1st Workshop on High Performance Serverless Computing. Virtual Event, 25–32

2021

[45] [45]

Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Franck Cappello. 2019. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. InIPDPS’19: The 2019 IEEE International Parallel and Distributed Processing Symposium. Rio de Janeiro, Brazil, 911–920

2019

[46] [46]

Bogdan Nicolae, Adam Moody, Greg Kosinovsky, Kathryn Mohror, and Franck Cappello. 2021. VELOC: VEry Low Overhead Checkpointing in the Age of Exascale. InSuperCheck’21: The First International Symposium on Checkpointing for Supercomputing. Virtual Event

2021

[47] [47]

Bogdan Nicolae, Justin M Wozniak, Tekin Bicer, Hai Nguyen, Parth Patel, Haochen Pan, Amal Gueroudji, Maxime Gonthier, Valerie Hayot-Sasson, Eliu Huerta, et al. 12

[48] [48]

Ine-Science’24: The 20th IEEE International Conference on e-Science

Diaspora: Resilience-enabling services for real-time distributed workflows. Ine-Science’24: The 20th IEEE International Conference on e-Science. Osaka, Japan, 1–9

[49] [49]

Shadi A Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H Campbell. 2017. Samza: stateful scalable stream processing at LinkedIn.Proceedings of the VLDB Endowment10, 12 (2017), 1634– 1645

2017

[50] [50]

Norbert Podhorszki et al. 2020. ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management.SoftwareX12 (2020)

2020

[51] [51]

Michael Prince, Doğa Gürsoy, Dina Sheyfer, Ryan Chard, Benoit Côté, Hannah Parraga, Barbara Frosik, Jon Tischler, and Nicholas Schwarz. 2023. Demonstrating Cross-Facility Data Processing at Scale With Laue Microdiffraction. InSC-W’23: Workshops of the International Conference for High Performance Computing, Net- work, Storage, and Analysis. Denver, USA, 2133–2139

2023

[52] [52]

Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. InSoCC’12: The 3rd ACM Symposium on Cloud Computing. San Jose, USA, 1–13

2012

[53] [53]

Richard D Schlichting and Fred B Schneider. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems.ACM Transactions on Computer Systems1, 3 (1983), 222–238

1983

[54] [54]

Won Wook Song, Taegeon Um, Sameh Elnikety, Myeongjae Jeon, and Byung-Gon Chun. 2023. Sponge: Fast reactive scaling for stream processing with serverless frameworks. InUSENIX ATC’23: The 2023 USENIX Annual Technical Conference. Boston, USA, 301–314

2023

[55] [55]

Sebastian Steiner, Jakob Wolf, Stefan Glatzel, Anna Andreou, Jarosław M Granda, Graham Keenan, Trevor Hinkley, Gerardo Aragon-Camarasa, Philip J Kitson, Davide Angelone, et al. 2019. Organic synthesis in a modular robotic system driven by a chemical programming language.Science363, 6423 (2019), eaav2211

2019

[56] [56]

Dawei Sun, Chunlin Zhang, Shang Gao, and Rajkumar Buyya. 2024. An adap- tive load balancing strategy for stateful join operator in skewed data stream environments.Future generation computer systems152 (2024), 138–151

2024

[57] [57]

P Thibault, M Dierolf, A Menzel, O Bunk, C David, and F Pfeiffer. 2008. High- resolution scanning x-ray diffraction microscopy.Science321, 5887 (2008), 379– 382

2008

[58] [58]

Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. InSIGMOD’14: The 2014 ACM SIGMOD International Conference on Management of Data. Snowbird, USA, 147–156

2014

[59] [59]

Xiaoli Yan, Nathaniel Hudson, Hyun Park, Daniel Grzenda, J Gregory Pauloski, Marcus Schwarting, Haochen Pan, Hassan Harb, Samuel Foreman, Chris Knight, et al. 2025. MOFA: Discovering Materials for Carbon Capture with a GenAI-and Simulation-Based Workflow.arXiv preprint arXiv:2501.10651(2025)

work page arXiv 2025

[60] [60]

John W Young. 1974. A first order approximation to the optimum checkpoint interval.Commun. ACM17, 9 (1974), 530–531

1974

[61] [61]

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. InSOSP’13: The 24th ACM Symposium on Operating Systems Principles. Farmington, USA, 423–438. 13

2013