Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Andrew McCallum; Avishek Joey Bose; Benjamin Rozonoyer; Dhruvesh Patel; Jacopo Minniti; Neil Band; Tim G. J. Rudner

arxiv: 2605.22967 · v2 · pith:AJFMFUXMnew · submitted 2026-05-21 · 💻 cs.LG

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Benjamin Rozonoyer , Jacopo Minniti , Dhruvesh Patel , Neil Band , Avishek Joey Bose , Tim G. J. Rudner , Andrew McCallum This is my paper

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion language modelsmasked diffusionrelay representationstruncated BPTTinference optimizationcoding tasksdiscrete diffusion

0 comments

The pith

Masked diffusion models can propagate latent information across denoising steps using a learned per-token relay channel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Learned Relay Representations to prevent masked diffusion models from discarding internal computations between refinement steps. Instead of resetting each time, a differentiable channel is learned to pass information forward, trained with truncated backpropagation through time. This is first justified on a Sudoku planning task and then scaled to Fast-dLLM v2, where it outperforms supervised fine-tuning on coding tasks and reduces latency by up to 32 percent. The approach integrates with existing techniques like block diffusion and KV caching.

Core claim

By introducing a differentiable per-token channel trained via truncated BPTT, diffusion language models can explicitly learn to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier when applied to state-of-the-art models like Fast-dLLM v2.

What carries the argument

Learned Relay Representations: a differentiable per-token channel that passes information between forward passes, trained via truncated backpropagation through time.

If this is right

The framework scales to state-of-the-art Diffusion Language Models.
Relay is compatible with block diffusion and KV caching.
It outperforms standard supervised finetuning on coding tasks.
Inference latency is reduced by up to 32 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The relay channel could potentially extend to other iterative generation methods that recompute states at each step.
Sudoku-based training of the relay might act as a proxy task for improving structured reasoning in language models.
Optimizing the channel length or structure could yield further latency gains on longer sequences.

Load-bearing premise

That the relay channel learned via truncated BPTT on a Sudoku task will transfer effectively to language modeling without introducing instability or requiring extensive additional hyperparameter search.

What would settle it

If applying Relay to Fast-dLLM v2 yields no performance gain over standard supervised finetuning or fails to reduce inference latency, the central scalability claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22967 by Andrew McCallum, Avishek Joey Bose, Benjamin Rozonoyer, Dhruvesh Patel, Jacopo Minniti, Neil Band, Tim G. J. Rudner.

**Figure 1.** Figure 1: Schematic of Relay over two consecutive inference steps. At each step k, the backbone fθ consumes the sum of embedded tokens Embθ(xtk ) and the projected relay state Rθ(hk), producing a hidden state hk+1 that is both unembedded into logits for the cross-entropy loss and forwarded through the relay module Rθ (orange path) into the next step. Tokens are progressively unmasked between steps (e.g. [M]→f at s… view at source ↗

**Figure 2.** Figure 2: Accuracy-NFE frontier on Sudoku-Extreme validation. Each curve traces a single training method as we sweep the inference confidence threshold τ ∈ {0.05, 0.10, 0.15, 0.20, 0.25}. A lower τ commits fewer cells per forward pass and so spends more NFEs (rightward), and vice-versa. Shaded ribbons denote ±1 sample standard deviation across three training seeds. that augments each puzzle with a step-by-step solve… view at source ↗

**Figure 3.** Figure 3: GPU memory during one training micro-step of Fast-dLLM v2 on an A100 80GB. Solid lines show the live GPU memory at every decoder-layer forward/backward hook. Dashed lines show the running maximum of live memory within the same micro-step (high-water mark). Phase labels (fwd, fwd2, bwd) mark each phase’s plateau. Relay carries higher live memory through fwd2, but its peak (≈ 20.1GiB) lands within ≈ 1GiB of … view at source ↗

read the original abstract

When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded, forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (Relay), a method that allows MDMs to be forward-thinking when denoising by explicitly learning how to propagate latent information for the benefit of future denoising steps. Relay introduces a differentiable per-token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state-of-the-art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in Relay on a challenging Sudoku-based planning task. We then scale Relay to Fast-dLLM v2, a state-of-the-art DLM, outperforming standard supervised finetuning on coding tasks while reducing inference latency by up to 32%. Our empirical results demonstrate that state-of-the-art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier. We provide code for all our experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Relay adds a trainable per-token channel with truncated BPTT to carry state across denoising steps in MDMs, with claims of scaling to Fast-dLLM v2 and 32% latency cuts on coding, but the abstract leaves controls and transfer details thin.

read the letter

The main point is that the authors introduce Learned Relay Representations: a differentiable per-token channel trained with truncated BPTT so that masked diffusion models can pass useful internal state forward instead of recomputing from scratch at every denoising round. They first validate the design on a Sudoku planning task, then plug it into Fast-dLLM v2 and report gains over plain supervised finetuning plus up to 32% lower inference latency, while claiming easy compatibility with block diffusion and KV caching. Code is released, which is useful on its own.

Referee Report

3 major / 1 minor

Summary. The paper introduces Learned Relay Representations (Relay) for Masked Diffusion Models (MDMs), which learns a differentiable per-token channel to propagate latent information between denoising steps, trained using truncated backpropagation through time (BPTT). Design choices are justified on a Sudoku-based planning task before scaling to Fast-dLLM v2, where Relay is claimed to outperform standard supervised finetuning on coding tasks, reduce inference latency by up to 32%, and remain compatible with block diffusion and KV caching. The manuscript provides code for all experiments.

Significance. Should the empirical findings prove robust, this work has the potential to advance diffusion language models by enabling explicit forward propagation of useful latent states, improving both performance and efficiency. The release of code is a positive aspect that supports reproducibility and further research in the area.

major comments (3)

[Scaling experiments to Fast-dLLM v2] The transfer of the relay channel learned via truncated BPTT on Sudoku to the larger Fast-dLLM v2 model is central to the claims of outperformance and latency reduction, yet the manuscript does not report any analysis of training stability, convergence issues, or the extent of hyperparameter retuning required, leaving open the possibility that gains are not solely attributable to the relay mechanism.
[Empirical results on coding tasks] The abstract and results claim outperforming SFT and 32% latency reduction, but provide no details on experimental controls, number of runs, statistical significance, or ablation studies, which undermines the ability to assess the reliability of these load-bearing empirical results.
[Compatibility with optimizations] The claim of seamless compatibility with block diffusion and KV caching lacks specific implementation details or ablations showing that the per-token relay channel integrates without performance degradation or additional overhead.

minor comments (1)

[Abstract] The abstract could benefit from a brief mention of the scale of the Sudoku task or key hyperparameters to provide context for the justification step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with proposed changes to the manuscript.

read point-by-point responses

Referee: [Scaling experiments to Fast-dLLM v2] The transfer of the relay channel learned via truncated BPTT on Sudoku to the larger Fast-dLLM v2 model is central to the claims of outperformance and latency reduction, yet the manuscript does not report any analysis of training stability, convergence issues, or the extent of hyperparameter retuning required, leaving open the possibility that gains are not solely attributable to the relay mechanism.

Authors: The Sudoku task served to rigorously justify the relay design under controlled planning conditions before scaling. We agree that explicit discussion of the transfer would strengthen the claims. In the revision we will add a subsection describing the hyperparameter transfer process, observed convergence on Fast-dLLM v2, and the limited retuning performed, while noting that the code release permits independent verification of stability. revision: yes
Referee: [Empirical results on coding tasks] The abstract and results claim outperforming SFT and 32% latency reduction, but provide no details on experimental controls, number of runs, statistical significance, or ablation studies, which undermines the ability to assess the reliability of these load-bearing empirical results.

Authors: We acknowledge that the manuscript does not currently report the requested experimental details. The released code contains the full evaluation pipeline. In the revision we will expand the experimental section to specify the number of runs, any observed variance, and additional ablations isolating the relay channel's contribution to both accuracy and latency gains. revision: yes
Referee: [Compatibility with optimizations] The claim of seamless compatibility with block diffusion and KV caching lacks specific implementation details or ablations showing that the per-token relay channel integrates without performance degradation or additional overhead.

Authors: The per-token relay channel is architecturally orthogonal to block processing and KV caching. We agree that concrete details are needed. The revision will include pseudocode for the integration, explicit statements on state maintenance across blocks, and any measured overhead from our existing experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is empirically trained and evaluated

full rationale

The paper introduces Learned Relay Representations as a new differentiable per-token channel trained via truncated BPTT. Design choices are justified empirically on a Sudoku planning task, then the module is scaled and evaluated on Fast-dLLM v2 for coding tasks, with reported gains over SFT and latency reductions. No derivation chain reduces a claimed result to its own fitted inputs by construction, no self-citation is load-bearing for a uniqueness claim, and no ansatz or renaming is smuggled in. The central claims rest on external empirical benchmarks rather than tautological reparameterization. This is the expected self-contained case for a trainable architectural addition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the learned relay channel and its training procedure; no explicit free parameters, axioms, or invented entities are detailed in the abstract beyond the introduction of the relay channel itself.

axioms (1)

domain assumption Truncated backpropagation through time suffices to train the relay channel without vanishing or exploding gradients across denoising steps.
Abstract states the training method relies on truncated BPTT.

pith-pipeline@v0.9.0 · 5772 in / 1286 out tokens · 28303 ms · 2026-05-25T05:55:50.454217+00:00 · methodology

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)