A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Amir Bar; Basile Terver; David Fan; Koustuv Sinha; Megi Dervishi; Mike Rabbat; Quentin Garrido; Randall Balestriero; Tushar Nagarajan; Wancong Zhang

arxiv: 2602.03604 · v3 · submitted 2026-02-03 · 💻 cs.CV · cs.AI

A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver , Randall Balestriero , Megi Dervishi , David Fan , Quentin Garrido , Tushar Nagarajan , Koustuv Sinha , Wancong Zhang

show 3 more authors

Mike Rabbat Yann LeCun Amir Bar

This is my paper

Pith reviewed 2026-05-16 08:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords joint-embedding predictive architectureself-supervised learningrepresentation learningenergy-based modelsworld modelsvideo predictionnavigation tasksopen-source library

0 comments

The pith

EB-JEPA library provides single-GPU implementations of joint-embedding predictive architectures that transfer from images to video and action-conditioned world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EB-JEPA as an open-source library that implements Joint-Embedding Predictive Architectures for self-supervised representation learning. These architectures train models to predict future states directly in embedding space rather than generating pixels, which sidesteps common issues in generative approaches. The library supplies modular code examples that begin with static image tasks on CIFAR-10, extend to multi-step temporal prediction on Moving MNIST, and reach action-conditioned planning on the Two Rooms navigation task. Reported results include 91 percent probing accuracy on image representations and 97 percent planning success when the full set of regularization terms is used. Comprehensive ablations across these examples establish that each regularization component plays an essential role in keeping the learned representations from collapsing.

Core claim

What carries the argument

The Joint-Embedding Predictive Architecture (JEPA), which encodes inputs and predicts future embeddings under an energy-based loss with multiple regularization terms that together prevent representation collapse.

If this is right

Image-level JEPA training extends to video by adding temporal multi-step prediction without changing the core architecture.
Representations learned this way support action-conditioned world models that achieve high planning success rates in navigation tasks.
Each listed regularization term must be present; removing any one produces collapse and sharply lower downstream accuracy.
The single-GPU, few-hour runtime design makes the same methods immediately usable for rapid experimentation on new tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular structure could support direct substitution of larger backbone networks or datasets once single-GPU constraints are relaxed.
Similar prediction-in-embedding recipes may apply to other sequential domains such as audio or robotics sensor streams.
Independent verification on held-out environments would clarify how far the 97 percent planning result generalizes beyond the Two Rooms setup.

Load-bearing premise

The single-GPU examples and ablations on small datasets like CIFAR-10, Moving MNIST, and Two Rooms are assumed to demonstrate that the code is correct and that image-level techniques transfer reliably to video and action models.

What would settle it

Retraining the CIFAR-10 example exactly as provided and obtaining probing accuracy near random chance would show that the reported performance does not hold.

read the original abstract

We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a library release paper packaging existing JEPA ideas into single-GPU examples, with solid CIFAR ablations but incomplete checks on transfer to video and control.

read the letter

The main takeaway is that this paper releases EB-JEPA, a lightweight open-source library with modular code for joint-embedding predictive architectures. It does not introduce new algorithms or theory; the core approach is from prior work by the same group. What it offers instead are runnable, self-contained examples that start with image-level SSL on CIFAR-10, move to multi-step prediction on Moving MNIST, and end with an action-conditioned world model on the Two Rooms navigation task, all designed to train on one GPU in a few hours. The GitHub link and the reported numbers (91% probing accuracy on CIFAR, 97% planning success) make the implementations concrete and easy to try.

Referee Report

2 major / 2 minor

Summary. The manuscript presents EB-JEPA, an open-source library providing modular implementations of energy-based Joint-Embedding Predictive Architectures (JEPAs). It illustrates transfer of image-level self-supervised learning techniques to video (multi-step prediction on Moving MNIST) and action-conditioned world models (97% planning success on Two Rooms navigation), with single-GPU examples. Ablations on CIFAR-10 report 91% linear probing accuracy and demonstrate that regularization components are critical for preventing representation collapse.

Significance. If the library implementations are correct and the regularization effects generalize beyond static images, the work could lower barriers to experimenting with energy-based predictive architectures for temporal and control tasks. The emphasis on lightweight, reproducible examples and open-source code supports accessibility and education, though the empirical claims rest primarily on end-to-end results for the more complex domains.

major comments (2)

[video and navigation examples] Video and navigation examples: only end-to-end results (multi-step prediction on Moving MNIST; 97% planning success on Two Rooms) are reported. No ablation tables, collapse metrics, or component-removal experiments are provided to verify that the same regularization components prevent representation collapse once temporal dynamics and action inputs are introduced, unlike the CIFAR-10 case.
[abstract and results sections] Transferability claim: the central assertion that image-level JEP A regularization transfers to video and action-conditioned models depends on the untested assumption that collapse-prevention effects observed on CIFAR-10 continue to hold. Without component-wise checks on the temporal and control tasks, the claim lacks direct empirical support.

minor comments (2)

[abstract] The abstract states 'comprehensive ablations' but the provided details indicate they are limited to CIFAR-10; clarify the scope of ablations in the text.
[results] Include error bars, number of runs, or statistical details for the reported 91% probing accuracy and 97% planning success rate to strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing the EB-JEPA library. We address the major comments below and will revise the manuscript accordingly to strengthen the presentation of results and claims.

read point-by-point responses

Referee: Video and navigation examples: only end-to-end results (multi-step prediction on Moving MNIST; 97% planning success on Two Rooms) are reported. No ablation tables, collapse metrics, or component-removal experiments are provided to verify that the same regularization components prevent representation collapse once temporal dynamics and action inputs are introduced, unlike the CIFAR-10 case.

Authors: We agree that the current manuscript provides detailed component ablations and collapse metrics only for the CIFAR-10 case. The video and navigation examples report end-to-end performance to demonstrate that the library implementations function correctly on temporal and control tasks. In the revision we will add collapse-related metrics (e.g., representation variance and norm statistics during training) for the Moving MNIST and Two Rooms examples to provide direct evidence that the same regularization terms remain effective. Full component-removal tables for these tasks will be included where computationally feasible within the single-GPU constraint emphasized in the paper. revision: yes
Referee: Transferability claim: the central assertion that image-level JEP A regularization transfers to video and action-conditioned models depends on the untested assumption that collapse-prevention effects observed on CIFAR-10 continue to hold. Without component-wise checks on the temporal and control tasks, the claim lacks direct empirical support.

Authors: We acknowledge that the transferability statement in the abstract and results sections relies partly on the successful end-to-end outcomes rather than identical component-wise ablations. We will revise the relevant sections to explicitly distinguish between the CIFAR-10 ablations (which directly verify regularization effects) and the video/control results (which serve as functional validation). The revised text will note that the library is structured to enable users to run the same ablations on new tasks, and we will add a short discussion of expected generalization based on the shared architecture. We will also moderate the wording of the transfer claim to avoid overstating the current empirical support. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical library results with no derivation chain

full rationale

The manuscript is a library presentation paper whose central claims are the availability of modular single-GPU code examples and the empirical performance numbers obtained by running them (91 % linear probing on CIFAR-10, 97 % planning success on Two Rooms). No equations, uniqueness theorems, fitted-parameter predictions, or ansatzes are introduced that could reduce to their own inputs. Ablations are reported only for the CIFAR-10 case; the video and navigation examples are end-to-end demonstrations rather than derivations. Because the work contains no load-bearing mathematical steps, self-citations, or self-definitional constructions, the circularity score is zero.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a library release paper, the abstract introduces no new free parameters, axioms, or invented entities; it relies on standard self-supervised learning practices.

pith-pipeline@v0.9.0 · 5565 in / 1197 out tokens · 103013 ms · 2026-05-16T08:02:47.614586+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

L = L_pred(g_ϕ(z, u), z′) + λR(z) ... preventing representation collapse ... variance term ... covariance term ... SIGReg ... optimal embedding distribution
IndisputableMonolith/Cost.lean Jcost echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

energy function E(x, y) measuring compatibility ... low energy indicates high compatibility

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 7.0

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 6.0

NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
Hierarchical Planning with Latent World Models
cs.LG 2026-04 unverdicted novelty 6.0

Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
cs.LG 2026-05 unverdicted novelty 5.0

The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.