pith. machine review for the scientific record. sign in

arxiv: 2605.11726 · v2 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

Ruihong Qiu, Yan Jiang, Zi Huang

Pith reviewed 2026-05-13 06:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords block sizediffusion large language modelsreinforcement learningmulti-domainpost-trainingdomain conflictGRPOrollout
0
0 comments X

The pith

Block size conflicts across domains degrade the effectiveness of rollout-based RL post-training for diffusion large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that block size, which sets the parallel decoding step in diffusion LLMs, creates a domain conflict when multiple domains are mixed during RL post-training. Different domains favor different block sizes for their rollout trajectories, and a single fixed size therefore limits gains from algorithms such as GRPO. To quantify and mitigate the problem, the authors build the Block-R1-41K dataset in which each sample is paired with its individually best block size, introduce a Block Size Conflict Score, and release the Block-R1 benchmark. They then show that simply using these per-sample block sizes during cross-domain training improves results over uniform-block baselines across many datasets and RL methods.

Core claim

In multi-domain RL for diffusion LLMs there is a domain block size conflict: the block size that produces the strongest rollout trajectories differs across domains, and this mismatch substantially reduces post-training effectiveness; assigning each training sample the block size that performed best for it during dataset construction removes the conflict and yields stronger cross-domain RL performance.

What carries the argument

Per-sample best-improved block size assignment, which replaces a single shared block size with an individualized choice for every training example to avoid domain-level trajectory mismatches.

If this is right

  • Cross-domain RL post-training for dLLMs improves when each sample uses its individually optimal block size instead of a shared value.
  • The Block-R1 benchmark supports standardized testing of RL algorithms in both single-domain and cross-domain regimes for diffusion LLMs.
  • A Block Size Conflict Score derived from the dataset measures the severity of domain mismatch for any given collection of tasks.
  • Performance gains appear across 13 datasets, 7 RL algorithms, and multiple dLLM backbones when the per-sample block sizes are applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-sample selection idea could be tested on other generation hyperparameters such as temperature or guidance scale in multi-domain RL.
  • The conflict score might serve as a simple criterion for deciding whether two domains can be safely trained together or should be kept separate.
  • Dynamic adjustment of block size during the RL loop itself, rather than fixing it before training, is a natural next experiment suggested by the static per-sample results.

Load-bearing premise

The block size found best for a sample in the initial dataset construction remains the best choice for that sample when it is later used inside joint multi-domain RL training.

What would settle it

Running the proposed cross-domain RL method with per-sample block sizes produces no improvement or lower final performance than a single fixed block size on the same multi-domain mixture.

Figures

Figures reproduced from arXiv: 2605.11726 by Ruihong Qiu, Yan Jiang, Zi Huang.

Figure 1
Figure 1. Figure 1: Motivation for Block-R1. Multi-domain RL refers to using all six domains as training [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Additional visualisations for domain block size conflict in multi-domain RL for dLLMs. (a) Relationship between BCS and multi-domain RL performance, where each point denotes domain pairs under vanilla fixed-block mix-domain RL and a larger BCS relates to stronger performance degradation. (b) Pairwise domain block size conflict visualisation, where darker red cells indicate stronger block size conflict betw… view at source ↗
Figure 3
Figure 3. Figure 3: Development of RL methods for dLLMs. Existing [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Motivation for Block-R1. Multi-domain RL refers to using all six domains as training [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average reward improvement under different training block sizes. For each domain and each block size c, the bar shows the mean teacher-student improvement E[∆(x, c) | x ∼ Dk], where ∆(x, c) = AθT (x, c) − AθS (x, c). Error bars denote 95% confidence intervals. The results show that block size significantly affects the reward improvement obtained during dLLM RL post-training across different domains. Some d… view at source ↗
Figure 6
Figure 6. Figure 6: Probability distribution of best-improved training block sizes per domain. Each cell shows the domain￾level training block size preference dis￾tribution P train k (c) defined in Equation 9. The dLLM is LLaDA2-16B. Darker cells indicate higher probability for the block size to be the best-improved block size. To further demonstrate domain-level block size preference in dLLM RL post-training, we visualise th… view at source ↗
Figure 7
Figure 7. Figure 7: Detailed domain-pair legend for BCS analysis. Each point denotes one pair of training domains used for vanilla fixed-block mix-domain RL with StableDRL. The y-axis reports the mean performance change between mix-domain RL and the corresponding single-domain RL results over the two domains [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detailed Illustration of Block-R1-41K Dataset Construction. Block-R1 constructs a [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms and diverse dLLM backbones are comprehensively covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1 with the dataset released at https://huggingface.co/datasets/YanJiangJerry/Block-R1-41K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that block size induces domain conflicts in multi-domain RL post-training of diffusion LLMs, materially degrading rollout-based methods such as GRPO. It formulates this conflict, constructs the Block-R1-41K dataset by assigning each sample a 'best-improved training block size' and deriving a quantitative Block Size Conflict Score, introduces the Block-R1 benchmark for single- and cross-domain RL, and proposes a simple sample-level best-block-size training method. Experiments are reported across 13 datasets, 7 RL algorithms, and multiple dLLM backbones, with code and data released.

Significance. If the claims are substantiated without circularity in dataset construction, the work would highlight an under-studied hyperparameter (block size) as a first-order factor in multi-domain dLLM RL, potentially improving post-training effectiveness. The new benchmark and open-sourced dataset would provide reusable artifacts for the community, strengthening reproducibility.

major comments (2)
  1. [Abstract and dataset construction] Abstract and §3 (dataset construction): the procedure for labeling each sample in Block-R1-41K with its 'best-improved training block size' is not specified (e.g., whether held-out rollouts, separate evaluation sets, or the identical training distribution are used). This is load-bearing for the central claim, because if the labeling re-uses the same samples later optimized by RL, the conflict score and reported gains become construction artifacts rather than evidence of intrinsic domain properties.
  2. [Experiments and conflict score] §4 (experiments) and conflict-score definition: without explicit ablation showing that the per-sample best-block-size assignment generalizes to unseen cross-domain RL training (rather than overfitting to the labeling process), the claim that block-size conflict 'will largely affect' post-training effectiveness remains unverified. The abstract supplies no quantitative deltas, R² values, or statistical tests supporting this.
minor comments (2)
  1. [Abstract] The abstract mentions 'extensive experiments' but reports no concrete metrics, baseline comparisons, or ablation tables; adding a summary table of key results would improve readability.
  2. [Methods] Notation for the Block Size Conflict Score should be defined explicitly with an equation in the main text rather than left implicit in the dataset description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to improve clarity and provide additional verification where needed.

read point-by-point responses
  1. Referee: [Abstract and dataset construction] Abstract and §3 (dataset construction): the procedure for labeling each sample in Block-R1-41K with its 'best-improved training block size' is not specified (e.g., whether held-out rollouts, separate evaluation sets, or the identical training distribution are used). This is load-bearing for the central claim, because if the labeling re-uses the same samples later optimized by RL, the conflict score and reported gains become construction artifacts rather than evidence of intrinsic domain properties.

    Authors: We agree that the labeling procedure requires explicit description to rule out circularity. The Block-R1-41K construction in §3 assigns best-improved block sizes using held-out rollouts on separate per-domain evaluation sets that are disjoint from the RL training samples. We have revised the abstract to note this separation and expanded §3 with the full procedure, including the reward-based improvement metric and confirmation that no training samples are reused for labeling. This ensures the conflict score reflects intrinsic domain properties. revision: yes

  2. Referee: [Experiments and conflict score] §4 (experiments) and conflict-score definition: without explicit ablation showing that the per-sample best-block-size assignment generalizes to unseen cross-domain RL training (rather than overfitting to the labeling process), the claim that block-size conflict 'will largely affect' post-training effectiveness remains unverified. The abstract supplies no quantitative deltas, R² values, or statistical tests supporting this.

    Authors: We acknowledge the need for explicit generalization evidence. In the revised manuscript we added an ablation in §4.3 that applies the per-sample block-size assignments to cross-domain RL on domain combinations held out from the labeling process entirely. Results demonstrate consistent gains, which we now quantify in the abstract (average improvements of 9–14% across the 7 RL algorithms) together with R² correlations between conflict score and performance drop plus t-test p-values. These additions verify the effect beyond the labeling step. revision: yes

Circularity Check

0 steps flagged

Dataset labels and conflict score are new artifacts; no reduction of claims to fitted inputs or self-citations

full rationale

The paper constructs Block-R1-41K by assigning a best-improved training block size per sample and derives a Block Size Conflict Score from those assignments, then proposes a post-training method that uses the sample-level assignments. This process creates new data artifacts and a benchmark rather than fitting parameters to a subset and renaming the fit as a prediction. No equations or steps are shown reducing the central claim (domain block size conflict affecting RL effectiveness) to a self-definitional loop or to quantities defined only by the authors' prior fitted values. Experiments span 13 datasets, 7 RL algorithms, and multiple backbones, with the benchmark released externally. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the derivation chain. The work is therefore self-contained against external benchmarks at the level of a minor self-citation score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the existence of measurable domain block size conflicts that affect RL rollout quality and on the transferability of per-sample best block sizes to cross-domain training. No explicit free parameters are named in the abstract. The conflict score is an invented metric whose construction details are not provided.

axioms (1)
  • domain assumption Block size determines parallel decoding granularity and affects rollout trajectories during RL optimization such as GRPO
    Directly stated in the abstract as the motivation for studying block size.
invented entities (1)
  • Block Size Conflict Score no independent evidence
    purpose: Quantitatively measure domain conflict induced by block size choices
    Introduced as part of the Block-R1-41K dataset construction; no independent evidence outside the paper is described.

pith-pipeline@v0.9.0 · 5589 in / 1404 out tokens · 76518 ms · 2026-05-13T06:57:23.417873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.