arxiv: 2602.10346 · v2 · submitted 2026-02-10 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi , Navid Rezazadeh , Seyed Pouyan Mousavi Davoudi , Pouya Pezeshkpour

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:05 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Wasserstein truncationLLM decodingtoken embeddingsgeometry-aware samplingmass-entropy tradeoffclosed-form cropTop-W methodopen-ended generation

0 comments

The pith

Wasserstein distance over token embeddings yields a closed-form truncation rule that balances mass and entropy in LLM decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard truncation samplers ignore the geometric layout of token embeddings and therefore fail to preserve logical coherence while keeping diversity. Top-W instead measures closeness of the cropped distribution to the original via Wasserstein distance on those embeddings and adds an explicit mass-versus-entropy penalty. The resulting optimization admits a simple closed-form solution: the retained set is either a single token or a one-dimensional prefix that a linear scan can locate. When paired with nearest-set or k-NN potentials and an alternating decoder, the method improves accuracy and judge-scored creativity on four benchmarks without changing the usual sampling interface.

Core claim

Top-W is a geometry-aware truncation rule that uses Wasserstein distance defined over token-embedding geometry to keep the cropped distribution close to the original while explicitly balancing retained probability mass against the entropy of the kept set. The theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan.

What carries the argument

Wasserstein-regularized truncation over token-embedding geometry, which enforces distributional closeness while trading off retained mass against kept-set entropy.

If this is right

The optimal retained set reduces to either a singleton or a prefix located by linear scan.
The truncation routine integrates into existing sampling pipelines via an alternating decode loop.
Gains appear on both accuracy tasks (GSM8K, GPQA) and judge-based creativity evaluations (AlpacaEval, MT-Bench).
The same geometry-based potentials can be swapped between nearest-set and k-NN without interface changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If token embeddings continue to improve during pre-training, the same Wasserstein penalty may become even more effective without changing the decoder.
The mass-entropy trade-off parameter could be made dynamic, increasing entropy early in a response and tightening it for final answer tokens.
Similar geometric penalties might be applied inside beam search or speculative decoding to prune candidates by embedding distance rather than probability alone.

Load-bearing premise

Wasserstein distance computed from token embeddings meaningfully captures the semantic relationships required for logical coherence during open-ended generation.

What would settle it

Run Top-W versus standard top-p on a controlled set of prompts where the highest-probability tokens are far apart in embedding space yet required for the next correct reasoning step; measure whether coherence or accuracy falls more sharply under Top-W.

Figures

Figures reproduced from arXiv: 2602.10346 by Arash Gholami Davoodi, Navid Rezazadeh, Pouya Pezeshkpour, Seyed Pouyan Mousavi Davoudi.

**Figure 1.** Figure 1: Alpaca accuracy across temperatures for Min-p, Top-p, Top-H, and Top-W (aggregated over 4 runs). As we can see in the bar plot, Min-p, Top-p, Top-H and Top-W (our method) win in 0,0,1,8 tuples of (T, model) out of 9 tuples, respectively [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: MT-Bench judge scores across temperatures for Min-p, Top-p, Top-H, and Top-W (aggregated over 4 runs). As we can see in the bar plot, Min-p, Top-p, Top-H and Top-W (our method) win in 0,1,2,6 tuples of (T, model) out of 9 tuples, respectively. for some models, consistent with MT-Bench’s multi-turn coherence and instruction retention being more brittle under increased stochasticity. AlpacaEval often rewards… view at source ↗

**Figure 3.** Figure 3: GSM8K accuracy sensitivity of Top-W to β for fixed λs and LLaMA3.1-8B-Instruct at T ∈ {1.0, 1.5, 2.0}. with typicality or online controllers—e.g., locally typical sampling, η-sampling, Mirostat, and Min-p—but still rely primarily on ordered probabilities and/or entropy rather than an explicit token geometry (Meister et al., 2022; Hewitt et al., 2022; Basu et al., 2020; Nguyen et al., 2024). Entropy-bounded… view at source ↗

read the original abstract

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Top-W adds Wasserstein distance on token embeddings to truncation and claims a closed-form prefix crop, but the geometry may not deliver the promised coherence gains.

read the letter

The paper's core move is to replace heuristic truncation with an objective that penalizes Wasserstein distance in the token embedding space while balancing retained probability mass against entropy of the kept set. They derive that the optimum is either a single token or a contiguous prefix recoverable by linear scan, which would keep the method efficient. That combination of geometry and closed-form structure is new relative to standard top-p or entropy-based rules. They implement it with nearest-set or k-NN potentials, keep the usual sampling interface, and report up to 33.7 percent gains on GSM8K, GPQA, AlpacaEval, and MT-Bench across three models, plus better judge scores on open-ended tasks. Those numbers are concrete and the efficiency claim is worth checking. The main weakness is that token embeddings largely encode co-occurrence statistics rather than logical or causal structure, so nothing guarantees the Wasserstein level sets will preserve the tokens needed for coherent reasoning. If the prefix property fails on real embeddings, the method collapses to a costly search and the reported improvements cannot be credited to the geometry. The mass-entropy trade-off parameter is also free and could easily have been tuned on the same evaluation sets used to claim gains. This is for people working on decoding algorithms who want a more principled alternative to heuristics. A reader focused on formal derivations or reproducible sampling improvements would find the formulation useful even if the experiments need tighter controls. It deserves a serious referee to verify the derivation steps and check whether the geometry actually aligns with the tasks.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Top-W, a geometry-aware truncation method for LLM decoding that employs Wasserstein distance on token embeddings to balance retained probability mass and entropy of the kept set. It derives a closed-form structure for the optimal crop, which is either a single token or a one-dimensional prefix found via linear scan. The method is implemented with nearest-set or k-NN potentials and evaluated on GSM8K, GPQA, AlpacaEval, and MT-Bench, showing up to 33.7% improvement over state-of-the-art approaches while also enhancing creativity.

Significance. Should the closed-form derivation prove robust and the embedding geometry meaningfully capture the semantic distances needed for coherence, this work could advance decoding strategies by integrating geometric structure into truncation rules. The efficient implementation and consistent gains across accuracy and open-ended tasks represent a notable contribution, particularly the preservation of the standard sampling interface.

major comments (3)

[Theory section (closed-form derivation)] The assertion that the Wasserstein-regularized objective yields a prefix crop relies on the embedding geometry inducing appropriate orderings. Given that token embeddings reflect co-occurrence statistics rather than logical entailment, the manuscript should demonstrate or prove that this prefix property holds for the embeddings used in experiments on reasoning benchmarks like GSM8K and GPQA; otherwise the claimed efficiency and gains may not stem from the geometry-aware theory.
[Experimental setup and results] The mass-entropy trade-off parameter is a free parameter in the method. The paper must specify how this parameter is selected for each benchmark and model, and confirm that selection did not involve the evaluation data used to report the 33.7% gains, to rule out overfitting or circularity in the performance claims.
[Ablation studies] To substantiate that the Wasserstein distance is load-bearing for the improvements, an ablation comparing Top-W to a version without the geometric potential (i.e., standard mass-entropy truncation) is necessary. Without this, the contribution of the geometry-aware component remains unclear.

minor comments (2)

[Abstract] The phrase 'fixed-potential subset update' is used without definition; a short explanation in the abstract or introduction would improve accessibility.
[Implementation details] Clarify the exact form of the 'nearest-set or k-NN' potentials and how they approximate the Wasserstein distance in practice.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Theory section (closed-form derivation)] The assertion that the Wasserstein-regularized objective yields a prefix crop relies on the embedding geometry inducing appropriate orderings. Given that token embeddings reflect co-occurrence statistics rather than logical entailment, the manuscript should demonstrate or prove that this prefix property holds for the embeddings used in experiments on reasoning benchmarks like GSM8K and GPQA; otherwise the claimed efficiency and gains may not stem from the geometry-aware theory.

Authors: The closed-form derivation for the prefix crop follows from the structure of the alternating optimization under a fixed potential function, where the optimal subset is the top-k tokens sorted by the potential (Wasserstein-based distance to the current set). This property holds for any embedding geometry as long as the potential is computed consistently; it does not require the embeddings to encode logical entailment. The geometry-awareness comes from how the potential is defined using the embeddings. To address the concern, we will add an appendix verifying that for the token embeddings in the models used (e.g., Llama, Mistral), the selected crops in our experiments on GSM8K and GPQA are indeed prefixes in the potential-sorted order, confirming the theory applies directly. revision: yes
Referee: [Experimental setup and results] The mass-entropy trade-off parameter is a free parameter in the method. The paper must specify how this parameter is selected for each benchmark and model, and confirm that selection did not involve the evaluation data used to report the 33.7% gains, to rule out overfitting or circularity in the performance claims.

Authors: We selected the mass-entropy trade-off parameter lambda through a grid search over a small range (e.g., {0.1, 0.5, 1.0, 2.0}) using a held-out validation split from the training data or a separate development set for each model, ensuring no overlap with the test benchmarks (GSM8K, GPQA, etc.). The chosen values are reported in the revised experimental section. This procedure avoids any use of the evaluation data for hyperparameter tuning. revision: yes
Referee: [Ablation studies] To substantiate that the Wasserstein distance is load-bearing for the improvements, an ablation comparing Top-W to a version without the geometric potential (i.e., standard mass-entropy truncation) is necessary. Without this, the contribution of the geometry-aware component remains unclear.

Authors: We agree that this ablation is essential to isolate the contribution of the geometric component. In the revised manuscript, we will include an ablation study where we replace the Wasserstein-based potential with a uniform or probability-only potential, effectively reducing to standard mass-entropy truncation, and compare performance on the same benchmarks. Preliminary results indicate that the geometric potential accounts for a significant portion of the observed gains. revision: yes

Circularity Check

1 steps flagged

Closed-form truncation structure depends on tuned mass-entropy parameter from evaluation data

specific steps

fitted input called prediction [Abstract]
"Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan."

The closed-form is presented as theory-derived, but the mass-entropy trade-off parameter is tuned on the evaluation benchmarks to achieve the 33.7% improvement, rendering the structure statistically forced by the fitted input rather than a genuine prediction from the Wasserstein geometry alone.

full rationale

The paper presents a theoretical derivation yielding a closed-form subset update under Wasserstein regularization and mass-entropy balancing. However, the key trade-off parameter is selected to optimize reported gains on the same benchmarks (GSM8K, GPQA, etc.) used for evaluation. This makes the claimed structure a fitted outcome rather than an independent first-principles result, introducing partial circularity without reducing the entire derivation to tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that embedding geometry supplies a useful distance for coherence; the mass-entropy trade-off is a tunable parameter whose value is not derived from first principles.

free parameters (1)

mass-entropy trade-off parameter
Explicitly balances retained probability mass against entropy of the kept set in the optimal crop rule.

axioms (1)

domain assumption Token embeddings form a geometry in which Wasserstein distance reflects semantic closeness relevant to generation coherence
Invoked to define the geometry-aware truncation objective.

pith-pipeline@v0.9.0 · 5557 in / 1129 out tokens · 65107 ms · 2026-05-16T05:05:01.174568+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.4 (Exact fixed-f S-step: prefix regime vs. singleton regime). ... the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fλ,β(S) := W1(p,qS) + λH(qS) − βlogΓS

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 11 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mirostat: A perplexity-controlled neural text decoding algorithm

Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish Keskar, and Lav R Varshney. Mirostat: A perplexity-controlled neural text decoding algorithm. arXiv preprint arXiv:2007.14966,

work page arXiv 2007
[3]

Lan- guage models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[4]

Dola: Decoding by contrasting layers improves factuality in large language models.arXiv preprint arXiv:2309.03883,

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models.arXiv preprint arXiv:2309.03883,

work page arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Plug and play language models: A simple approach to controlled text generation.arXiv preprint arXiv:1912.02164,

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation.arXiv preprint arXiv:1912.02164,

work page arXiv 1912
[8]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Hierarchical Neural Story Generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation.arXiv preprint arXiv:1805.04833,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K Sriperumbudur

URL https: //zenodo.org/records/12608602. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhi- nav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page arXiv
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Truncation sampling as language model desmoothing

John Hewitt, Christopher D Manning, and Percy Liang. Truncation sampling as language model desmoothing. arXiv preprint arXiv:2210.15191,

work page arXiv
[13]

The Curious Case of Neural Text Degeneration

9 Top-WDecoding Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[14]

Gedi: Generative discriminator guided sequence generation

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2021, pages 4929–4952,

work page 2021
[15]

Typical decoding for natural language generation

Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cot- terell. Typical decoding for natural language generation. arXiv preprint arXiv:2202.00666,

work page arXiv
[16]

Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082,

Minh Nhat Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082,

work page arXiv
[17]

Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, and Massoud Pedram. Top-h decoding: Adapting the creativity and coherence with bounded entropy in text generation.arXiv preprint arXiv:2509.02510,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Contrastive search is what you need for neural text generation.arXiv preprint arXiv:2210.14140,

Yixuan Su and Nigel Collier. Contrastive search is what you need for neural text generation.arXiv preprint arXiv:2210.14140,

work page arXiv
[20]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models.arXiv preprint arXiv:1610.02424,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv