Orbax: Distributed Checkpointing with JAX

Abhishek Agrawal; Adam Cogdell; Anastasia Petrushkina; Angel Mau; Colin Gaffney; Daniel Ng; Justin Pan; Kiranbir Sodhia; Marco Berlot; Mridul Sahu

arxiv: 2605.23066 · v2 · pith:O6E5V4USnew · submitted 2026-05-21 · 💻 cs.DC · cs.LG

Orbax: Distributed Checkpointing with JAX

Colin Gaffney , Shutong Li , Daniel Ng , Anastasia Petrushkina , Niket Kumar , Adam Cogdell , Mridul Sahu , Yaning Liang

show 8 more authors

Nikhil Bansal Justin Pan Angel Mau Abhishek Agrawal Marco Berlot Ruoxin Sang Kiranbir Sodhia Rakesh Iyer

This is my paper

Pith reviewed 2026-05-25 05:02 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords OrbaxJAXcheckpointingdistributed trainingmachine learningaccelerator systemsperformance comparisonPyTorch

0 comments

The pith

Orbax supplies a modular JAX-native checkpointing library that abstracts distributed accelerator complexities and exceeds PyTorch speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

JAX's modular design has left distributed ML users without a standard way to handle checkpointing across accelerators. Orbax fills this gap by offering a library that manages the underlying hardware details while allowing flexible operations on checkpoints at any stage of model development. The design supports both saving and loading in ways that integrate directly with JAX workflows. Reported results show these operations complete up to 3.5 times faster for saves and 2 times faster for loads than comparable PyTorch tools. If accurate, the approach reduces the engineering effort needed to make large-scale JAX training resilient.

Core claim

Orbax is presented as a JAX-native library that modularizes checkpointing for distributed systems, hiding accelerator-specific details while exposing user-friendly interfaces for manipulating saved states throughout the training lifecycle, with measured performance advantages over PyTorch baselines on the evaluated workloads.

What carries the argument

Orbax, the modular checkpointing library that abstracts distributed accelerator complexities for JAX.

If this is right

JAX practitioners gain a ready-made solution for distributed checkpointing without building custom code for each accelerator setup.
Checkpoint operations can occur more frequently during training because the reduced time overhead lowers the cost of each save.
Users obtain consistent interfaces for manipulating checkpoints at multiple points in the model lifecycle rather than ad-hoc scripts.
Adoption would shift engineering effort away from low-level distributed I/O toward higher-level model logic.
Direct comparisons position Orbax as a faster alternative whenever teams consider switching between JAX and PyTorch ecosystems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use could encourage JAX framework maintainers to treat checkpointing as a first-class concern in future releases.
The modularity might support plugging in new storage backends or compression schemes not covered in the initial benchmarks.
Lower checkpoint latency could make it practical to save model states after every few steps in very large training runs.
Teams running mixed JAX and PyTorch code might standardize on Orbax-style interfaces even outside pure JAX environments.

Load-bearing premise

The speedups depend on the chosen benchmarks and hardware setups being representative of typical JAX distributed training and on the PyTorch baselines having been implemented and measured equivalently.

What would settle it

A side-by-side measurement on a different workload or hardware configuration that fails to show the reported factors of speedup in save or load time.

Figures

Figures reproduced from arXiv: 2605.23066 by Abhishek Agrawal, Adam Cogdell, Anastasia Petrushkina, Angel Mau, Colin Gaffney, Daniel Ng, Justin Pan, Kiranbir Sodhia, Marco Berlot, Mridul Sahu, Niket Kumar, Nikhil Bansal, Rakesh Iyer, Ruoxin Sang, Shutong Li, Yaning Liang.

**Figure 1.** Figure 1: Orbax Checkpoint API structure Appendix B. Ecosystem Compatibility As we have noted above, fragmentation in the JAX ecosystem was an important motivator in the development of Orbax. Incompatibility of checkpoints across different codebases acted as a brake on development and experimentation. While Orbax provides a shared standard format and interface for JAX users, fragmentation between the JAX and PyTorch… view at source ↗

**Figure 2.** Figure 2: Asynchronous Saving Metadata Operations: For a given checkpoint path, data is written to a temporary directory corresponding to the requested path. The temporary directory may or may not be the same as the final path, depending on whether the underlying filesystem supports atomic directory renames. This allows Orbax to ensure atomicity of the written checkpoint, which guards against interruptions while wri… view at source ↗

**Figure 3.** Figure 3: Distributed Saving Appendix D. Loading Logic Unlike saving, checkpoint loading is a much less frequent operation, and is thus less of a performance bottleneck for most applications (assuming asynchronous saving is used). In most training codebases, loading occurs synchronously after model compilation; the compiled 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Generating the Loading Plan. When a user provides an abstract state, Orbax uses it as a strict contract to validate the nested structure and apply transformations. Otherwise, it relies on checkpoint metadata to validate the active accelerator topology. implementation details from the user. As long as the user has the necessary Pathways dependencies linked, all logic automatically switches over to single-co… view at source ↗

**Figure 5.** Figure 5: Architectural comparison of checkpointing execution under multi-controller (mcJAX) and single-controller (Pathways) environments. Pathways delegates highbandwidth data transfers to worker hosts via the colocated Python API, eliminating the central controller as an I/O bottleneck. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visualizing aligned vs. unaligned read requests during checkpoint restoration. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

In a landscape of high-performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX's modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a modular, JAX-native checkpointing library that abstracts the complexities of distributed accelerator systems while also providing flexibility for user-friendly checkpoint manipulations throughout the ML model lifecycle. We demonstrate performance exceeding comparable PyTorch competitors by up to 3.5$\times$ for saving and 2$\times$ for loading. The library is available at https://github.com/google/orbax.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Orbax, a modular JAX-native checkpointing library for distributed ML systems that abstracts complexities of accelerator hardware while supporting flexible checkpoint manipulations. It claims performance exceeding comparable PyTorch competitors by up to 3.5× for saving and 2× for loading, with the library released at https://github.com/google/orbax.

Significance. If the performance claims are substantiated, Orbax would address a clear gap in standardized checkpointing for JAX's modular ecosystem and could become a practical tool for large-scale distributed training workflows.

major comments (1)

[Abstract] Abstract: The central empirical claim of up to 3.5× faster saving and 2× faster loading versus PyTorch competitors is presented without any workload descriptions, hardware specifications, benchmark methodology, error bars, or details on PyTorch baseline implementation and optimization. This prevents verification that the comparisons use equivalent conditions and representative JAX distributed training scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. We agree that the abstract requires additional context to allow readers to evaluate the performance claims and will revise the manuscript to address this.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim of up to 3.5× faster saving and 2× faster loading versus PyTorch competitors is presented without any workload descriptions, hardware specifications, benchmark methodology, error bars, or details on PyTorch baseline implementation and optimization. This prevents verification that the comparisons use equivalent conditions and representative JAX distributed training scenarios.

Authors: We agree that the abstract should not present the performance numbers without supporting context. In the revised version we will expand the abstract to briefly note the workloads (large-scale distributed training of transformer models), hardware (multi-host TPU v4 clusters), and high-level benchmark methodology, while directing readers to the Experiments section for full details including error bars, PyTorch baseline configurations, and optimization steps. The comparisons were performed under matched conditions on representative JAX workloads; we will make this explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: software artifact with empirical benchmarks, no derivation chain

full rationale

The paper describes the design and implementation of the Orbax checkpointing library for JAX, along with reported performance numbers from benchmarks. No mathematical derivations, equations, predictions from first principles, fitted parameters, or uniqueness theorems appear in the text. The central claims are engineering and empirical rather than analytic, so none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can be instantiated. The work is therefore self-contained against external benchmarks with a circularity score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and software paper; no free parameters, mathematical axioms, or invented physical entities are present.

pith-pipeline@v0.9.0 · 5680 in / 996 out tokens · 35307 ms · 2026-05-25T05:02:27.508958+00:00 · methodology

Orbax: Distributed Checkpointing with JAX

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)