arxiv: 2604.08065 · v1 · submitted 2026-04-09 · 💻 cs.LG

Recognition: unknown

Multimodal Latent Reasoning via Predictive Embeddings

Ashutosh Adhikari , Mirella Lapata

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords multimodal reasoninglatent spacepredictive embeddingstool-augmented modelsvision-language modelsJEPAPearl

0 comments

The pith

Pearl learns predictive embeddings from expert trajectories to let multimodal models reason in latent space without explicit tool calls at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Pearl, a framework that trains vision-language models to predict future embeddings directly from sequences of multimodal states in expert tool-use trajectories. This sidesteps the need to invoke external tools such as cropping or depth estimation during inference, while retaining the model's standard token-generation pipeline. The method is presented as model-agnostic and able to handle chains of multiple tool uses without extra supervision. A reader would care because current tool-augmented systems add latency and error risk; Pearl aims to deliver comparable perception performance more efficiently. Experiments on several benchmarks show results that match or exceed both standard supervised fine-tuning and reconstruction-based latent methods, and the authors note that reconstruction approaches appear to learn embeddings more than actual image edits.

Core claim

Pearl is a JEPA-inspired method that learns predictive embeddings from multimodal trajectories in latent space. Unlike reconstruction-based latent reasoning, which generates tokens autoregressively and suffers training-inference mismatch, Pearl directly aligns embeddings from expert trajectories while keeping the ordinary vision-language generation process unchanged. The framework is model-agnostic, straightforward to train, and naturally accommodates multi-step tool trajectories. On multiple perception benchmarks it matches or surpasses supervised fine-tuning and reconstruction baselines. The work also supplies evidence that reconstruction methods primarily acquire embeddings rather than em

What carries the argument

Pearl's predictive embedding alignment, which trains the model to forecast the next embedding in an expert tool-use trajectory from the current multimodal state.

If this is right

Explicit tool invocation and its associated latency are removed from the inference path.
Multi-step tool sequences become feasible without additional supervision or pipeline changes.
The approach integrates with any existing vision-language model architecture without architectural modification.
Performance on perception tasks equals or exceeds both supervised fine-tuning and reconstruction-based latent reasoning.
Reconstruction objectives in latent space are less effective than direct predictive alignment for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Inference speed for complex visual tasks could improve substantially once tool calls are internalized as embedding predictions.
The same trajectory-based training could be tested on language-only or robotic tool-use domains to check whether predictive embeddings generalize beyond vision.
If predictive embeddings capture tool effects more reliably than reconstruction, future systems might shift from explicit tool APIs to learned latent simulators.

Load-bearing premise

Expert tool-use trajectories supply enough signal for the model to learn embeddings that can stand in for explicit tool executions while preserving multi-step reasoning at inference time.

What would settle it

Apply Pearl to a held-out collection of tool-use problems that require novel tool combinations never seen in the training trajectories; if accuracy drops sharply without explicit tool calls while standard tool-augmented baselines remain accurate, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.08065 by Ashutosh Adhikari, Mirella Lapata.

**Figure 1.** Figure 1: PEARL architecture. Left: Solid arrows denote forward-pass dataflow; dashed arrows denote which components contribute to each loss (no forward pass). During training, two independent forward passes encode ⟨I0, Q⟩ and the expert trajectory R. A tiedweights predictor maps the input encoding to the trajectory latent space. LJEPA aligns the predicted embedding ˆhR with the stop-gradient target hR; LVLM preser… view at source ↗

**Figure 3.** Figure 3: Correlation between the number of reasoning steps and accuracy across BLINK [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 2.** Figure 2: Cumulative distribution function (CDF) of the number of latent tokens per training example (x-axis) over a sample of ∼19k examples used to train LVR [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 4.** Figure 4: T-SNE visualization of the views I0, Q and R across tasks for Qwen2.5-VL-7BInstruct trained with PEARL on the left, compared with simple fine-tuning with next-token prediction on the right. A Additional Results and Hyperparameter Settings A.1 Hyperparameter Settings We set λ in Equation (4) to 0.2 and the number of [PRED] tokens to 4 across all training settings. All experiments are conducted on either 4 … view at source ↗

**Figure 5.** Figure 5: Training architecture for latent-augmented multimodal reasoning. The input image ximg is encoded into image embeddings and concatenated with text query tokens. The model autoregressively generates a sequence of continuous latent reasoning tokens (delimited by <lat> . . . </lat>), followed by discrete text answer tokens. Latent tokens are supervised with a continuous regression loss Llatent against a visual… view at source ↗

read the original abstract

Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is model-agnostic, simple to train, and naturally supports trajectories with multiple tool calls. Experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches. Furthermore, we provide empirical evidence that reconstruction-based methods primarily learn embeddings rather than image edits in latent space, motivating predictive embedding learning as a more principled alternative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pearl's predictive embeddings look promising for cutting inference cost in tool-using VLMs, but the claim that they can fully stand in for actual tool outputs rests on an assumption that needs tighter testing.

read the letter

The main takeaway is that Pearl learns predictive embeddings directly from expert tool-use trajectories in latent space, so a VLM can skip explicit calls to tools like croppers or depth estimators at inference while still handling multi-step reasoning. It draws from JEPA and points out that reconstruction methods mostly end up learning embeddings anyway rather than true image edits, which gives the shift to prediction a clearer rationale. The setup stays model-agnostic, keeps the normal vision-language generation path, and handles chains of tool uses without forcing autoregressive latent tokens. That combination is practical for deployment questions around overhead and error rates. The observation about what reconstruction actually learns is a useful empirical note that challenges the usual justification for those approaches. The soft spot is whether the predicted embeddings carry enough of the structured, high-precision information that real tools provide. Pixel crops and metric depths are exact signals; if the embedding approximation drifts even a little, later steps in a chain can go off without single-step benchmarks catching it. The abstract reports matching or better results on perception benchmarks, but the details on datasets, metrics, controls, and how substitution quality was measured are not visible here, so the strength of that evidence is still unclear. This is aimed at people working on efficient multimodal agents and latent reasoning shortcuts. A reader already thinking about JEPA-style prediction or tool-augmented VLMs would pick up the framing and the critique of reconstruction baselines. It deserves a serious referee because the core problem is real and the proposed alternative is straightforward, even if the evaluation will need more scrutiny on embedding fidelity before the substitution claim lands cleanly. I'd send it for review.

Referee Report

3 major / 1 minor

Summary. The paper proposes Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns predictive embeddings directly from expert multimodal tool-use trajectories entirely in latent space. This allows VLMs to substitute the learned embeddings for explicit tool invocations (e.g., cropping, depth estimation) at inference time while preserving the standard vision-language generation pipeline. The approach is claimed to be model-agnostic, simple to train, and naturally supportive of multi-step trajectories, in contrast to reconstruction-based latent reasoning methods that suffer from training-inference mismatch. Experiments on multiple perception benchmarks are said to show that Pearl matches or outperforms supervised fine-tuning and reconstruction baselines; additional evidence is presented that reconstruction methods primarily learn embeddings rather than image edits.

Significance. If the central empirical claims hold under rigorous controls, Pearl would offer a practical route to lower inference overhead for tool-augmented VLMs by shifting tool interactions to training-time latent prediction, while retaining multi-step reasoning capability. The contrast with reconstruction methods and the suggestion that predictive alignment is more principled could influence future work on latent-space reasoning in multimodal models.

major comments (3)

[Abstract] Abstract: the central empirical claim that 'experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches' is presented without any benchmark names, metrics, controls, statistical details, or ablation results. Because this claim is load-bearing for the contribution, the absence of these specifics prevents assessment of whether the predictive-embedding substitution actually preserves multi-step reasoning fidelity.
[Experiments] The skeptic concern is not addressed: the assumption that JEPA-style predicted embeddings can substitute for structured tool outputs (pixel crops, metric depth maps) without degrading chained reasoning is untested. Single-step perception benchmarks may pass while downstream multi-step error rates rise due to distribution mismatch; the manuscript should include targeted ablations comparing explicit-tool vs. predicted-embedding trajectories on multi-step tasks.
[Method] No equations or loss formulations are shown for the predictive embedding alignment objective, the JEPA-style prediction head, or how trajectories with multiple tool calls are encoded. Without these, it is impossible to verify that the method is parameter-free, differs meaningfully from standard contrastive or reconstruction losses, or avoids the training-inference mismatch it criticizes in reconstruction baselines.

minor comments (1)

[Abstract] The abstract states that reconstruction methods 'primarily learn embeddings rather than image edits,' but this observation is not quantified or tied to a specific figure or table in the summary.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical claims, multi-step reasoning support, and methodological details. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that 'experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches' is presented without any benchmark names, metrics, controls, statistical details, or ablation results. Because this claim is load-bearing for the contribution, the absence of these specifics prevents assessment of whether the predictive-embedding substitution actually preserves multi-step reasoning fidelity.

Authors: We agree that the abstract should provide more concrete details to allow immediate assessment of the claims. In the revised version, we will expand the abstract to name the specific perception benchmarks, report key metrics with comparisons to baselines, and briefly note the controls, ablations, and evidence regarding multi-step fidelity. This will make the load-bearing empirical contribution more transparent without altering the high-level narrative. revision: yes
Referee: [Experiments] The skeptic concern is not addressed: the assumption that JEPA-style predicted embeddings can substitute for structured tool outputs (pixel crops, metric depth maps) without degrading chained reasoning is untested. Single-step perception benchmarks may pass while downstream multi-step error rates rise due to distribution mismatch; the manuscript should include targeted ablations comparing explicit-tool vs. predicted-embedding trajectories on multi-step tasks.

Authors: This is a valid concern regarding potential error propagation in chained reasoning. While our current experiments use expert trajectories that inherently involve multi-step tool sequences and show competitive performance on perception benchmarks, we acknowledge that explicit ablations isolating the substitution effect on multi-step error rates would strengthen the evidence. We will add such targeted comparisons in the revised manuscript, evaluating both explicit-tool and predicted-embedding trajectories on tasks requiring chained reasoning. revision: yes
Referee: [Method] No equations or loss formulations are shown for the predictive embedding alignment objective, the JEPA-style prediction head, or how trajectories with multiple tool calls are encoded. Without these, it is impossible to verify that the method is parameter-free, differs meaningfully from standard contrastive or reconstruction losses, or avoids the training-inference mismatch it criticizes in reconstruction baselines.

Authors: We apologize for this omission in the initial submission, which hinders verification of the technical distinctions. The revised manuscript will include the full equations for the predictive embedding alignment objective, the JEPA-style prediction head architecture, and the encoding of multi-tool trajectories. These additions will explicitly demonstrate how the approach differs from contrastive and reconstruction losses while eliminating the training-inference mismatch. revision: yes

Circularity Check

0 steps flagged

No circularity: purely conceptual framework with no equations or derivations

full rationale

The provided manuscript text and abstract describe Pearl at a high conceptual level as a JEPA-inspired approach that learns predictive embeddings from expert trajectories, without any equations, mathematical derivations, or explicit reduction of claims to fitted parameters. No self-definitional loops, fitted-input predictions, or load-bearing self-citations are present. The central claims rest on experimental comparisons to supervised fine-tuning and reconstruction baselines rather than on any internal construction that equates outputs to inputs. This is the expected non-finding for a methods paper that does not attempt a formal derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5483 in / 1007 out tokens · 82514 ms · 2026-05-10T17:21:34.133524+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages

[1]

Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

Springer Nature Switzerland. ISBN 978-3-031-73337-6. Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Kuan Li, Yida Zhao, Huifeng Yin, Yong Jiang, Pengjun Xie, Fei Huang, Huaxiu Yao, Yi R. Fung, and Jingren Zhou. Webwatcher: Breaking new frontiers of vision-language deep research agent. InThe Fourteenth Int...

work page arXiv 2026
[2]

Littman, and Anthony R

ISSN 0004-3702. doi: https://doi.org/10.1016/S0004-3702(98)00023-X. URL https: //www.sciencedirect.com/science/article/pii/S000437029800023X. Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, and Zicheng Liu. Latent visual reasoning, 2025. URL https://arxiv.org/abs/2509.24251. Daniel Nichols, Prajwa...

work page doi:10.1016/s0004-3702(98)00023-x 2025
[3]

V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

URLhttps://aclanthology.org/2025.findings-acl.1149/. Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. 2025b. URL https://api. semanticscholar.org/CorpusID:278904571. Penghao Wu and Sa...

work page arXiv 2025
[4]

12 Preprint

URLhttps://openreview.net/forum?id=noidywkBba. 12 Preprint. Under review. 80 60 40 20 0 20 40 60 t-SNE dim 1 10 0 10 20 30 40 50 60 t-SNE dim 2 ThinkMorph PEARL Jigsaw Assembly Spatial Navigation View 1: I, Q View 2: R 40 20 0 20 40 60 t-SNE dim 1 0 10 20 30 40 50t-SNE dim 2 ThinkMorph SFT Figure 4: T-SNE visualization of the views I0, Q and R across task...

2025
[5]

name": "crop image normalized

finetuning dataset. PixelReasoner Example #13: Visual Question Answering with T ool Use System You are a helpful assistant. You may call one or more functions to assist with the user query. Available tools: •crop image normalized(bbox 2d, target image) — zoom into a bounding-box region of an image. •select frames(target frames)— select frames from a video...

2019