arxiv: 2604.18739 · v1 · submitted 2026-04-20 · 💻 cs.LG · stat.ML

Recognition: unknown

Discrete Tilt Matching

Yuyuan Chen , Shiyi Wang , Peter Potaptchik , Jaeyeon Kim , Michael S. Albergo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:27 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords discrete tilt matchingmasked diffusion LLMslikelihood-free fine-tuningreward tiltingcontrol variatesunmasking posteriorsdLLM

0 comments

The pith

Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting to sidestep intractable marginal likelihoods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Masked diffusion large language models resist standard reinforcement learning fine-tuning because their sequence-level marginal likelihoods cannot be computed tractably. The paper derives Discrete Tilt Matching as a likelihood-free alternative that instead aligns the local unmasking posteriors at each state after tilting them by the reward. This produces a weighted cross-entropy objective whose minimizer is known in closed form and that can incorporate control variates for stable optimization. Experiments confirm that an annealing schedule together with those control variates prevents mode collapse on a planning task, while scaling the method to an 8B-parameter model improves results on Sudoku and Countdown without hurting performance on MATH500 or GSM8K.

Core claim

We derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM's annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.

What carries the argument

Discrete Tilt Matching objective, a weighted cross-entropy loss that aligns reward-tilted local unmasking posteriors at each state.

If this is right

DTM can be applied to any masked diffusion model without requiring approximations to full-sequence marginals.
Control variates and annealing schedules keep training stable and avoid mode collapse during reward-based fine-tuning.
An 8B model fine-tuned with DTM improves on structured tasks such as Sudoku and Countdown.
Performance on standard math benchmarks stays competitive after the same fine-tuning procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The state-level formulation could allow finer-grained reward shaping than sequence-level methods permit.
Similar tilting and matching steps may transfer to other discrete diffusion or non-autoregressive generators.
The explicit control variates might be reusable in other likelihood-free alignment settings on discrete data.

Load-bearing premise

Matching state-level local unmasking posteriors under reward tilting is sufficient to achieve effective sequence-level fine-tuning without access to marginal likelihoods.

What would settle it

A head-to-head comparison of DTM against an exact sequence-level RL method on a small dLLM where marginal likelihoods can be computed exactly, checking whether the resulting sequence-level performance is identical.

Figures

Figures reproduced from arXiv: 2604.18739 by Jaeyeon Kim, Michael S. Albergo, Peter Potaptchik, Shiyi Wang, Yuyuan Chen.

**Figure 1.** Figure 1: Evaluation accuracy of DTM and baseline methods on benchmarks. All methods are of length 256 in 128 denoising steps. and often outperforms, prior RL-based fine-tuning efforts (Zhao et al., 2025; Tang et al., 2025; Wang et al., 2025a; Rojas et al., 2025; Yang et al., 2025) on LLaDA8B-Instruct (Nie et al., 2025) across Sudoku, Countdown, MATH500, and GSM8K. We highlight our main contributions: • Derivation… view at source ↗

**Figure 2.** Figure 2: Comparison of performance on maze planning task for DTM with and without the control variate. 4.1. Practical Interventions DTM is adaptive to semi-autoregressive decoding. Many state-of-the-art dLLMs are deployed with semiautoregressive (SAR) decoding, generating blocks autoregressively while allowing any-order updates within each block (Han et al., 2023; Arriola et al., 2025; Nie et al., 2025; Kim et al… view at source ↗

**Figure 3.** Figure 3: Ablation on annealing step size h on Countdown. Left figure shows the correct fraction on the evaluation set of the model checkpoints. Right figure shows the training reward trajectory. A moderate step size h = 6 achieves the best result. rising from 36.0 and 81.6 at generation length 256 to 40.2 and 83.2 at length 512, which is consistent with the view that stronger local predictions can be converted into… view at source ↗

**Figure 4.** Figure 4: Wallclock comparison for DTM and SPG on Sudoku, both trained on 8 H100 GPUs. DTM attains a higher reward and is more efficient. Since DTM reward is evaluated for frozen model πa within each a 7→ a + h phase, the reward is roughly constant per phase, and jumps at the phase boundary when the model πa is updated to πθ ≈ πa+h as in Algorithm 1. The SPG reward is evaluated on the training batch, thus showing a … view at source ↗

**Figure 5.** Figure 5: Proportion of valid paths (a) mean rewards (b), and diversity of paths (c) against degrees of tilt a. Our model is trained on three sets of control variate c and annealing steps h. Small step size and control variate 1 has higher path diversity, validity and rewards. (a) Effective generation length of DTM on Countdown (b) d1 (light green), WD1 (red), UniGRPO (dark green), SPG (blue) (from [PITH_FULL_IMAG… view at source ↗

**Figure 6.** Figure 6: Effective generation length of DTM versus RL baselines. With direct training on state-level posteriors, DTM achieves stable reasoning. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: The fixed 41-by-41 maze with door fraction 0.4. i.e. the L 1 -distance. For each completion of the form z = (s, g, SEP, z1, z2, ..., zn, PAD, ..., PAD), we say the path is valid if z1 = s, zn = g, all zi’s are non-wall cells in the maze, and all consecutive cells (zi , zi+1) are direct neighbors in the maze (i.e. with Manhattan distance 1). The base model produces valid paths with probability 0.313 under s… view at source ↗

**Figure 8.** Figure 8: Prompt used for MATH500 and GSM8K. The problem statement is appended directly after the template. Respond in the following format: <reasoning> ... </reasoning> <answer> ... </answer> Using only the numbers [38, 92, 52] create an arithmetic expression that evaluates to exactly 78. You must use all numbers from the list, and each number must be used exactly once. You may use the operations +, -, *, and / as … view at source ↗

**Figure 9.** Figure 9: Prompt used for Countdown. In each instance, the list [38, 92, 52] is replaced by the provided numbers in the actual question and 78 is replaced by the actual target value. Sudoku We experiment on the 4×4 Sudoku dataset5 . We adopted SPG’s modification on the original split to avoid train-test leakage: the dataset contains 1M puzzles spanning all 288 possible completed 4×4 solution grids. They randomly sel… view at source ↗

**Figure 10.** Figure 10: Sudoku prompt template. We use 3-shot prompting: three solved puzzle exemplars are inserted; the evaluation set uses disjoint underlying solutions from the exemplars. To avoid repetition, we refer to Appendix D.3 of Wang et al. (2025a) for the 3 exemplars. D.2. Hyperparameters and Implementation Details Following the baselines, we employ LoRA with a rank of r = 128 and scaling factor α = 64, 4-bit quantiz… view at source ↗

**Figure 11.** Figure 11: Comparison of performance on Countdown for DTM with random interpolant versus SAR-aligned interpolant. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

read the original abstract

Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM's annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DTM gives a clean likelihood-free objective for tilting local unmasking steps in dLLMs, but the claim that this automatically delivers the right sequence-level distribution still needs a tighter proof.

read the letter

The paper's core move is to treat dLLM fine-tuning as matching the reward-tilted local posteriors over unmasking actions at each masked state, which produces a weighted cross-entropy loss whose minimizer is explicit and that can be stabilized with control variates. That formulation is new relative to earlier RL adaptations for diffusion models and directly addresses the intractability of sequence marginals. The maze experiment shows the annealing schedule and variates keep training from collapsing, and the 8B LLaDA runs report gains on Sudoku and Countdown while staying competitive on the math sets. Those are concrete, usable pieces of work. The derivation itself looks formally grounded on its own terms. The main gap is whether local posterior matching under tilting induces the correct global measure over complete sequences. The unmasking process is a Markov chain whose marginals depend on the full trajectory, so tilting only the per-state conditionals does not automatically guarantee the path measure matches the reward-tilted sequence distribution unless a telescoping or martingale argument is supplied. The abstract and the reported experiments focus on stability and downstream scores rather than verifying that the fixed point of the local objective coincides with the desired global optimum. The scaling results are encouraging but would be stronger with more explicit baselines and ablations that isolate the contribution of the tilting step. This is the kind of paper that people working on non-autoregressive LLMs and on RL for structured generation will want to read. It is worth sending to referees because the problem it attacks is real, the proposed fix is technically novel, and the empirical signals are positive even if the theoretical link needs more work.

Referee Report

2 major / 2 minor

Summary. The paper derives Discrete Tilt Matching (DTM), a likelihood-free objective for fine-tuning masked diffusion LLMs (dLLMs) by recasting the problem as state-level matching of local unmasking posteriors under reward tilting. DTM is expressed as a weighted cross-entropy loss with an explicit minimizer and admits control variates for improved stability. Experiments on a synthetic maze task examine the effects of annealing and control variates on stability and mode collapse, while scaling to LLaDA-8B-Instruct shows gains on Sudoku and Countdown with competitive performance on MATH500 and GSM8K.

Significance. If the local matching is proven to induce the correct global sequence-level distribution, DTM would provide a valuable practical tool for dLLM fine-tuning by avoiding intractable marginal likelihoods. The explicit minimizer and control variates represent clear strengths for training. The reported empirical improvements on reasoning tasks are promising, though the synthetic results focus mainly on stability rather than confirming global optimality.

major comments (2)

Abstract: The central derivation recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting to yield a weighted cross-entropy with explicit minimizer, but no steps are shown establishing that this local objective's fixed point coincides with the reward-tilted sequence measure (i.e., that the Markov chain of unmasking steps preserves the global path measure under tilting).
Synthetic maze-planning task: The analysis probes how the annealing schedule and control variates affect stability and prevent mode collapse, but does not test whether the DTM fixed point equals the desired global optimum under the sequence-level reward-tilted distribution, leaving the sufficiency of local matching unverified.

minor comments (2)

Experimental details on the precise form of the control variates, the functional form of the annealing schedule, and the exact baselines used for comparison are insufficient for full reproducibility.
Quantitative results on the large-scale tasks (e.g., exact accuracy deltas, standard deviations, or number of runs) are not reported in the abstract or summary, weakening the strength of the 'strong gains' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight an important aspect of the derivation that would benefit from greater explicitness. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and proof steps.

read point-by-point responses

Referee: Abstract: The central derivation recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting to yield a weighted cross-entropy with explicit minimizer, but no steps are shown establishing that this local objective's fixed point coincides with the reward-tilted sequence measure (i.e., that the Markov chain of unmasking steps preserves the global path measure under tilting).

Authors: We agree that the manuscript would be strengthened by including explicit intermediate steps in the derivation. In the revised version we will expand Section 3 to contain a dedicated subsection that proves the fixed point of the local DTM objective coincides with the reward-tilted sequence measure. The argument proceeds by showing that the unmasking process is a Markov chain on states and that consistent tilting of the local posteriors at each step preserves the global path measure; we will write out the telescoping product of the tilted transition probabilities and demonstrate that the minimizer of the weighted cross-entropy is exactly the desired tilted distribution. This addition directly addresses the concern. revision: yes
Referee: Synthetic maze-planning task: The analysis probes how the annealing schedule and control variates affect stability and prevent mode collapse, but does not test whether the DTM fixed point equals the desired global optimum under the sequence-level reward-tilted distribution, leaving the sufficiency of local matching unverified.

Authors: The synthetic maze experiments were designed to isolate and quantify the effects of annealing schedules and control variates on training stability and mode collapse, which are practically critical for scaling DTM. We acknowledge that these runs do not empirically confirm global optimality. With the expanded theoretical proof we will add (per the first comment), the equivalence between local and global objectives will be established analytically. In the revision we will clarify the purpose of the synthetic section and explicitly reference the theoretical guarantee, so that readers understand the experiments address implementation issues rather than the sufficiency proof itself. revision: partial

Circularity Check

0 steps flagged

No circularity: DTM derived as independent objective with explicit minimizer

full rationale

The paper presents DTM as a derived likelihood-free objective that recasts dLLM fine-tuning via state-level matching of tilted local unmasking posteriors, expressed as a weighted cross-entropy with an explicit closed-form minimizer. This construction does not reduce to fitted inputs renamed as predictions, self-definitional loops, or load-bearing self-citations; the abstract and description show a forward derivation from the masked diffusion Markov structure to the new loss without tautological equivalence to prior parameters. The sufficiency of local-to-global induction is a separate modeling claim, not a circularity in the derivation chain itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The derivation relies on standard assumptions in masked diffusion models and introduces control variates as part of the method.

free parameters (1)

annealing schedule
Mentioned as affecting stability on the synthetic task, likely requires tuning.

axioms (1)

domain assumption The unmasking process in dLLMs allows for local posterior computation
Assumed to enable state-level matching of posteriors.

pith-pipeline@v0.9.0 · 5458 in / 1123 out tokens · 45614 ms · 2026-05-10T04:27:32.531006+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages · 4 internal anchors

[1]

2025 , url =

Arel , title =. 2025 , url =

2025
[2]

International Conference on Learning Representations (ICLR 2019) , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR 2019) , year =

2019
[3]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Proceedings of the 41st International Conference on Machine Learning (ICML 2024) , series =

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author =. Proceedings of the 41st International Conference on Machine Learning (ICML 2024) , series =. 2024 , address =

2024
[5]

Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation , author=. arXiv preprint arXiv:2506.20639 , year=

work page arXiv
[6]

Dream-coder 7b: An open diffusion language model for code.arXiv preprint arXiv:2509.01142, 2025

Dream-coder 7b: An open diffusion language model for code , author=. arXiv preprint arXiv:2509.01142 , year=

work page arXiv
[7]

2025 , eprint=

LLaDA2.0: Scaling Up Diffusion Language Models to 100B , author=. 2025 , eprint=

2025
[8]

2026 , eprint=

Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model , author=. 2026 , eprint=

2026
[9]

Potaptchik, Peter and Lee, Cheuk-Kit and Albergo, Michael S

Tilt Matching for Scalable Sampling and Fine-Tuning , author = "Potaptchik, Peter and Lee, Cheuk-Kit and Albergo, Michael S.", eprint=
[10]

International Conference on Learning Representations (ICLR 2022) , year =

LoRA: Low-Rank Adaptation of Large Language Models , author =. International Conference on Learning Representations (ICLR 2022) , year =

2022
[11]

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =
[12]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. arXiv preprint arXiv:2402.03300 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

2025 , eprint =

Proximal Diffusion Neural Sampler , author =. 2025 , eprint =

2025
[14]

2023 , pages =

SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2023.acl-long.647 , url =

work page doi:10.18653/v1/2023.acl-long.647 2023
[15]

Proceedings of the 42nd International Conference on Machine Learning (ICML 2025) , series =

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions , author =. Proceedings of the 42nd International Conference on Machine Learning (ICML 2025) , series =. 2025 , address =

2025
[16]

Any-order flexible length masked diffusion.arXiv preprint arXiv:2509.01025,

Any-Order Flexible Length Masked Diffusion , author =. 2025 , eprint =. doi:10.48550/arXiv.2509.01025 , url =

work page doi:10.48550/arxiv.2509.01025 2025
[17]

The Twelfth International Conference on Learning Representations (ICLR 2024) , year =

Let’s Verify Step by Step , author =. The Twelfth International Conference on Learning Representations (ICLR 2024) , year =

2024
[18]

Reinforcing diffusion models by direct group preference optimization.arXiv preprint arXiv:2510.08425, 2025

Reinforcing Diffusion Models by Direct Group Preference Optimization , author =. 2025 , eprint =. doi:10.48550/arXiv.2510.08425 , url =

work page doi:10.48550/arxiv.2510.08425 2025
[19]

Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

Large Language Diffusion Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

2025
[20]

2025 , url =

Pan, Jiayi and Zhang, Junjie and Wang, Xingyao and Yuan, Lifan , title =. 2025 , url =

2025
[21]

arXiv preprint arXiv:2510.08554 , year=

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization , author =. 2025 , eprint =. doi:10.48550/arXiv.2510.08554 , url =

work page doi:10.48550/arxiv.2510.08554 2025
[22]

International Conference on Learning Representations (ICLR 2025) , year =

Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective , author =. International Conference on Learning Representations (ICLR 2025) , year =

2025
[23]

wd1: Weighted policy optimiza- tion for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838, 2025

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models , author =. 2025 , eprint =. doi:10.48550/arXiv.2507.08838 , url =

work page doi:10.48550/arxiv.2507.08838 2025
[24]

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models , author =. 2025 , eprint =. doi:10.48550/arXiv.2510.09541 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.09541 2025
[25]

d2 : Improved techniques for training reasoning diffusion language models

d2: Improved Techniques for Training Reasoning Diffusion Language Models , author =. 2025 , eprint =. doi:10.48550/arXiv.2509.21474 , url =

work page doi:10.48550/arxiv.2509.21474 2025
[26]

Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

MMaDA: Multimodal Large Diffusion Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

2025
[27]

Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

2025
[28]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling , author=. arXiv preprint arXiv:2409.02908 , year=

work page arXiv
[29]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models , author =. 2025 , eprint =. doi:10.48550/arXiv.2505.19223 , url =

work page internal anchor Pith review doi:10.48550/arxiv.2505.19223 2025
[30]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

2024
[31]

2024 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2024 , eprint=

2024
[32]

2022 , eprint=

MaskGIT: Masked Generative Image Transformer , author=. 2022 , eprint=

2022
[33]

2024 , eprint=

Discrete Flow Matching , author=. 2024 , eprint=

2024
[34]

2025 , eprint=

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. 2025 , eprint=

2025
[35]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

2017
[36]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022
[37]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024