pith. machine review for the scientific record. sign in

arxiv: 2605.07820 · v2 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scaling Categorical Flow Maps

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:01 UTC · model grok-4.3

classification 💻 cs.LG
keywords categorical flow mapsflow matchinglanguage modelingself-distillationfew-step samplingdiscrete diffusionscalinglikelihood bounds
0
0 comments X

The pith

Categorical flow maps scale to 1.7 billion parameters, enabling high-quality text generation in four inference steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that flow matching can be scaled for language modeling by first training a large continuous flow model and then distilling it into a faster categorical version. A 1.7 billion parameter base model trained on 2.1 trillion tokens is self-distilled to produce diverse text in as few as four steps while preserving token entropy close to the training data level. The work also derives a likelihood bound for these models in the semi-discrete case, allowing them to be scored on standard language modeling benchmarks at levels comparable to discrete diffusion methods. Challenges specific to large-scale training are identified along with guidance on loss weighting and time scheduling to address them.

Core claim

By training a 1.7B-parameter base flow model on 2.1T tokens and self-distilling it into a CFM, the authors achieve generation of diverse, high-quality text in as few as 4 inference steps while maintaining near-data-level token entropy. They further provide a likelihood bound for CFMs in the semi-discrete setting and demonstrate that these models can score competitively on standard LM benchmarks, on par with discrete diffusion methods.

What carries the argument

Self-distillation of a large flow model into a Categorical Flow Map (CFM) that matches flow from Gaussian noise to one-hot encoded data for fast discrete sampling, together with a derived likelihood bound for evaluation.

If this is right

  • CFMs support few-step sampling for language models while retaining sample diversity close to the data.
  • The likelihood bound allows direct scoring of CFMs on perplexity and other LM benchmarks.
  • Insights on loss weighting and scheduling stabilize training of these models at billion-parameter scale.
  • Performance stays in the same range as discrete diffusion methods for both generation and evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaling holds, flow-based methods could enable faster inference than autoregressive decoding in some language applications.
  • The semi-discrete likelihood bound might extend to evaluation of other continuous-discrete hybrid models.
  • Self-distillation techniques shown here could be tested on scaling flow models for other discrete sequences such as code or biological data.

Load-bearing premise

The self-distillation process combined with the selected loss weighting and time schedule preserves the base model's performance and diversity without introducing new biases at this scale.

What would settle it

Measuring the 4-step CFM on held-out data and finding either token entropy well below training data levels or benchmark scores outside the range achieved by comparable discrete diffusion models would falsify the central claim.

read the original abstract

Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ($<1$B), leaving the question of its scalability completely open. In this article, we train a $1.7$B-parameter base flow model on $2.1$T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as $4$ inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to scale Categorical Flow Maps (CFMs) for language modeling by training a 1.7B-parameter base flow model on 2.1T tokens, then self-distilling it into a CFM that produces diverse, high-quality text in as few as 4 inference steps while retaining near-data-level token entropy. It introduces a likelihood bound for CFMs in the semi-discrete setting that enables scoring on standard LM benchmarks at levels comparable to discrete diffusion methods, and derives prescriptive guidance on loss weighting and time scheduling after identifying scale-related training challenges.

Significance. If the central scaling and distillation results hold with supporting evidence, the work would establish that CFMs can reach large LM scales and deliver competitive few-step sampling, providing a non-autoregressive alternative with potential advantages in speed and flexibility. The semi-discrete likelihood bound would be a useful addition for model evaluation, and the scaling insights could inform training of other continuous generative models on discrete data.

major comments (3)
  1. [Abstract and §5 (Self-Distillation Experiments)] The abstract and experimental claims assert that the distilled CFM maintains 'near-data-level token entropy' at 4 steps, yet no quantitative comparison (e.g., entropy values or histograms) of the base 1.7B flow model versus the distilled CFM versus the data distribution is supplied; without this, the retention of diversity after self-distillation cannot be verified.
  2. [§4 (Training at Scale) and §5.1 (Loss Weighting)] The paper states that challenges at scale were uncovered and that specific loss weighting and time scheduling were derived to address them, but provides no ablation tables comparing the chosen scheme against alternatives (or against uniform weighting) on metrics such as entropy collapse, mode coverage, or benchmark scores; this leaves the prescriptive guidance unsupported.
  3. [§3.3 (Likelihood Bound Derivation) and Table 2 (Benchmark Results)] The introduced likelihood bound for the semi-discrete CFM setting is used to report LM benchmark results 'in the same range as discrete diffusion,' but no validation of bound tightness (e.g., comparison to Monte Carlo estimates or exact likelihoods on a held-out set) or sensitivity analysis to the number of steps is given, weakening the reliability of the reported scores.
minor comments (2)
  1. [§2 (Preliminaries)] Define the semi-discrete setting and the precise form of the flow map more explicitly in the introduction or preliminaries to make the transition from continuous flow matching to categorical data clearer for readers.
  2. [Figure 4 and Figure 5] Include the base flow model and data entropy as explicit reference lines in all entropy and diversity plots so that the 'near-data-level' claim can be visually assessed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §5 (Self-Distillation Experiments)] The abstract and experimental claims assert that the distilled CFM maintains 'near-data-level token entropy' at 4 steps, yet no quantitative comparison (e.g., entropy values or histograms) of the base 1.7B flow model versus the distilled CFM versus the data distribution is supplied; without this, the retention of diversity after self-distillation cannot be verified.

    Authors: We agree that explicit quantitative entropy comparisons would allow direct verification of diversity retention. Although the manuscript supports the claim through benchmark performance and qualitative observations, we will add to Section 5 a table and accompanying histograms reporting token entropy for the base 1.7B model, the distilled CFM at 4 steps (and other step counts), and the empirical data distribution. revision: yes

  2. Referee: [§4 (Training at Scale) and §5.1 (Loss Weighting)] The paper states that challenges at scale were uncovered and that specific loss weighting and time scheduling were derived to address them, but provides no ablation tables comparing the chosen scheme against alternatives (or against uniform weighting) on metrics such as entropy collapse, mode coverage, or benchmark scores; this leaves the prescriptive guidance unsupported.

    Authors: We acknowledge that the prescriptive guidance would be more robust with explicit ablations. The weighting and scheduling choices were informed by observed instabilities during our 1.7B-scale training runs. We will add an appendix containing ablation tables that compare our scheme against uniform weighting and selected alternatives, reporting effects on entropy, mode coverage, and benchmark scores. revision: yes

  3. Referee: [§3.3 (Likelihood Bound Derivation) and Table 2 (Benchmark Results)] The introduced likelihood bound for the semi-discrete CFM setting is used to report LM benchmark results 'in the same range as discrete diffusion,' but no validation of bound tightness (e.g., comparison to Monte Carlo estimates or exact likelihoods on a held-out set) or sensitivity analysis to the number of steps is given, weakening the reliability of the reported scores.

    Authors: We recognize the value of validating bound tightness. We will expand Section 3.3 and the appendix with Monte Carlo likelihood estimates on a held-out set together with a sensitivity analysis across discretization step counts. These additions will directly support the reliability of the scores in Table 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and a new bound

full rationale

The paper's core claims derive from direct large-scale training of a 1.7B base flow model on 2.1T tokens, followed by self-distillation into a CFM, plus introduction of a semi-discrete likelihood bound used for LM benchmark scoring. These steps are presented as experimental outcomes and a novel theoretical contribution rather than reductions to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The abstract and skeptic analysis reference external comparisons to discrete diffusion methods and note challenges uncovered at scale, with prescriptive guidance on weighting and scheduling emerging from the experiments themselves. No equations or derivations in the provided text collapse by construction to prior inputs or author-specific uniqueness theorems.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that flow matching can be effectively applied to discrete categorical data at large scales, with specific training adjustments for loss and time.

free parameters (2)
  • loss weighting scheme
    Prescriptive insights on loss weighting for training at scale
  • time scheduling
    Insights on time scheduling to address challenges at scale
axioms (2)
  • domain assumption Categorical flow maps can be obtained by flow matching between Gaussian and one-hot encoded data
    Based on prior works mentioned in abstract
  • domain assumption Self-distillation preserves the quality of the base flow model
    Used to create the CFM

pith-pipeline@v0.9.0 · 5540 in / 1637 out tokens · 65110 ms · 2026-05-12T04:01:52.545020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages

  1. [1]

    2023 , eprint=

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions , author=. 2023 , eprint=

  2. [2]

    2026 , eprint=

    Categorical Flow Maps , author=. 2026 , eprint=

  3. [3]

    2026 , eprint=

    Discrete Flow Maps , author=. 2026 , eprint=

  4. [4]

    2026 , eprint=

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising , author=. 2026 , eprint=

  5. [5]

    2023 , eprint=

    Flow Matching for Generative Modeling , author=. 2023 , eprint=

  6. [6]

    2025 , eprint=

    Variational Flow Matching for Graph Generation , author=. 2025 , eprint=

  7. [7]

    2025 , eprint=

    Flow map matching with stochastic interpolants: A mathematical framework for consistency models , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    How to build a consistency model: Learning flow maps via self-distillation , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    Mean Flows for One-step Generative Modeling , author=. 2025 , eprint=

  10. [10]

    2020 , eprint=

    Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

  11. [11]

    2015 , eprint=

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. 2015 , eprint=

  12. [12]

    2021 , eprint=

    Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2021 , eprint=

  13. [13]

    2021 , eprint=

    Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , author=. 2021 , eprint=

  14. [14]

    2024 , eprint=

    Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

  15. [15]

    2024 , eprint=

    Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2024 , eprint=

  16. [16]

    2024 , eprint=

    Discrete Flow Matching , author=. 2024 , eprint=

  17. [17]

    2022 , eprint=

    A Continuous Time Framework for Discrete Denoising Models , author=. 2022 , eprint=

  18. [18]

    2022 , eprint=

    Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

  19. [19]

    2023 , eprint=

    Consistency Models , author=. 2023 , eprint=

  20. [20]

    2025 , eprint=

    Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control , author=. 2025 , eprint=

  21. [21]

    2024 , eprint=

    Fisher Flow Matching for Generative Modeling over Discrete Data , author=. 2024 , eprint=

  22. [22]

    2024 , eprint=

    Dirichlet Flow Matching with Applications to DNA Sequence Design , author=. 2024 , eprint=

  23. [23]

    2024 , eprint=

    Categorical Flow Matching on Statistical Manifolds , author=. 2024 , eprint=

  24. [24]

    2026 , eprint=

    Meta Flow Maps enable scalable reward alignment , author=. 2026 , eprint=

  25. [25]

    2026 , eprint=

    Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps , author=. 2026 , eprint=

  26. [26]

    2026 , eprint=

    Scaling Beyond Masked Diffusion Language Models , author=. 2026 , eprint=

  27. [27]

    2023 , eprint=

    Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

  28. [28]

    2025 , eprint=

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model , author=. 2025 , eprint=

  29. [29]

    2021 , eprint=

    MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , author=. 2021 , eprint=

  30. [30]

    2013 , eprint=

    Auto-Encoding Variational Bayes , author=. 2013 , eprint=

  31. [31]

    2019 , eprint=

    Neural Ordinary Differential Equations , author=. 2019 , eprint=

  32. [32]

    2025 , eprint=

    The Diffusion Duality , author=. 2025 , eprint=

  33. [33]

    2018 , eprint=

    FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , author=. 2018 , eprint=

  34. [34]

    2022 , eprint=

    Continuous diffusion for categorical data , author=. 2022 , eprint=

  35. [35]

    2022 , eprint=

    Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=

  36. [36]

    2026 , eprint=

    Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. 2026 , eprint=

  37. [37]

    2025 , eprint=

    Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration , author=. 2025 , eprint=

  38. [38]

    2023 , eprint=

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2023 , eprint=

  39. [39]

    2025 , eprint=

    Large Language Diffusion Models , author=. 2025 , eprint=

  40. [40]

    2025 , eprint=

    ReDi: Rectified Discrete Flow , author=. 2025 , eprint=

  41. [41]

    2025 , eprint=

    Distillation of Discrete Diffusion through Dimensional Correlations , author=. 2025 , eprint=

  42. [42]

    2024 , eprint=

    One Step Diffusion via Shortcut Models , author=. 2024 , eprint=

  43. [43]

    2026 , eprint=

    Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs , author=. 2026 , eprint=

  44. [44]

    2022 , eprint=

    Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

  45. [45]

    2025 , eprint=

    Terminal Velocity Matching , author=. 2025 , eprint=

  46. [46]

    2023 , eprint=

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

  47. [47]

    2025 , eprint=

    Beyond Autoregression: Fast LLMs via Self-Distillation Through Time , author=. 2025 , eprint=

  48. [48]

    2024 , eprint=

    Consistency Models Made Easy , author=. 2024 , eprint=

  49. [49]

    Hutchinson

    M.F. Hutchinson , title =. Communications in Statistics - Simulation and Computation , volume =. 1990 , publisher =. doi:10.1080/03610919008812866 , URL =

  50. [50]

    2023 , eprint=

    Improved Techniques for Training Consistency Models , author=. 2023 , eprint=

  51. [51]

    2025 , eprint=

    CANDI: Hybrid Discrete-Continuous Diffusion Models , author=. 2025 , eprint=

  52. [52]

    2022 , eprint=

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. 2022 , eprint=

  53. [53]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  54. [54]

    Language Models are Unsupervised Multitask Learners , author=

  55. [55]

    Cut Your Losses in Large-Vocabulary Language Models , booktitle =

    Erik Wijmans and Brody Huval and Alexander Hertzberg and Vladlen Koltun and Philipp Kr. Cut Your Losses in Large-Vocabulary Language Models , booktitle =. 2025 , url =

  56. [56]

    2021 , eprint=

    Variational Diffusion Models , author=. 2021 , eprint=

  57. [57]

    arXiv preprint arXiv:2505.15270 , year=

    Scaling Diffusion Transformers Efficiently via mu P , author=. arXiv preprint arXiv:2505.15270 , year=

  58. [58]

    2024 , eprint=

    Simple Guidance Mechanisms for Discrete Diffusion Models , author=. 2024 , eprint=

  59. [59]

    2026 , eprint=

    LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling , author=. 2026 , eprint=