arxiv: 2605.11726 · v2 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

Ruihong Qiu, Yan Jiang, Zi Huang

Pith reviewed 2026-05-13 06:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords block sizediffusion large language modelsreinforcement learningmulti-domainpost-trainingdomain conflictGRPOrollout

0 comments

The pith

Block size conflicts across domains degrade the effectiveness of rollout-based RL post-training for diffusion large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that block size, which sets the parallel decoding step in diffusion LLMs, creates a domain conflict when multiple domains are mixed during RL post-training. Different domains favor different block sizes for their rollout trajectories, and a single fixed size therefore limits gains from algorithms such as GRPO. To quantify and mitigate the problem, the authors build the Block-R1-41K dataset in which each sample is paired with its individually best block size, introduce a Block Size Conflict Score, and release the Block-R1 benchmark. They then show that simply using these per-sample block sizes during cross-domain training improves results over uniform-block baselines across many datasets and RL methods.

Core claim

In multi-domain RL for diffusion LLMs there is a domain block size conflict: the block size that produces the strongest rollout trajectories differs across domains, and this mismatch substantially reduces post-training effectiveness; assigning each training sample the block size that performed best for it during dataset construction removes the conflict and yields stronger cross-domain RL performance.

What carries the argument

Per-sample best-improved block size assignment, which replaces a single shared block size with an individualized choice for every training example to avoid domain-level trajectory mismatches.

If this is right

Cross-domain RL post-training for dLLMs improves when each sample uses its individually optimal block size instead of a shared value.
The Block-R1 benchmark supports standardized testing of RL algorithms in both single-domain and cross-domain regimes for diffusion LLMs.
A Block Size Conflict Score derived from the dataset measures the severity of domain mismatch for any given collection of tasks.
Performance gains appear across 13 datasets, 7 RL algorithms, and multiple dLLM backbones when the per-sample block sizes are applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-sample selection idea could be tested on other generation hyperparameters such as temperature or guidance scale in multi-domain RL.
The conflict score might serve as a simple criterion for deciding whether two domains can be safely trained together or should be kept separate.
Dynamic adjustment of block size during the RL loop itself, rather than fixing it before training, is a natural next experiment suggested by the static per-sample results.

Load-bearing premise

The block size found best for a sample in the initial dataset construction remains the best choice for that sample when it is later used inside joint multi-domain RL training.

What would settle it

Running the proposed cross-domain RL method with per-sample block sizes produces no improvement or lower final performance than a single fixed block size on the same multi-domain mixture.

Figures

Figures reproduced from arXiv: 2605.11726 by Ruihong Qiu, Yan Jiang, Zi Huang.

**Figure 2.** Figure 2: Additional visualisations for domain block size conflict in multi-domain RL for dLLMs. (a) Relationship between BCS and multi-domain RL performance, where each point denotes domain pairs under vanilla fixed-block mix-domain RL and a larger BCS relates to stronger performance degradation. (b) Pairwise domain block size conflict visualisation, where darker red cells indicate stronger block size conflict betw… view at source ↗

**Figure 3.** Figure 3: Development of RL methods for dLLMs. Existing [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Motivation for Block-R1. Multi-domain RL refers to using all six domains as training [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Average reward improvement under different training block sizes. For each domain and each block size c, the bar shows the mean teacher-student improvement E[∆(x, c) | x ∼ Dk], where ∆(x, c) = AθT (x, c) − AθS (x, c). Error bars denote 95% confidence intervals. The results show that block size significantly affects the reward improvement obtained during dLLM RL post-training across different domains. Some d… view at source ↗

**Figure 6.** Figure 6: Probability distribution of best-improved training block sizes per domain. Each cell shows the domainlevel training block size preference distribution P train k (c) defined in Equation 9. The dLLM is LLaDA2-16B. Darker cells indicate higher probability for the block size to be the best-improved block size. To further demonstrate domain-level block size preference in dLLM RL post-training, we visualise th… view at source ↗

**Figure 7.** Figure 7: Detailed domain-pair legend for BCS analysis. Each point denotes one pair of training domains used for vanilla fixed-block mix-domain RL with StableDRL. The y-axis reports the mean performance change between mix-domain RL and the corresponding single-domain RL results over the two domains [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Detailed Illustration of Block-R1-41K Dataset Construction. Block-R1 constructs a [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms and diverse dLLM backbones are comprehensively covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1 with the dataset released at https://huggingface.co/datasets/YanJiangJerry/Block-R1-41K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new dataset, conflict score, and benchmark for block size in multi-domain dLLM RL, but the per-sample labeling step risks circularity that needs explicit checks.

read the letter

Block size conflicts in multi-domain RL for diffusion LLMs get a new dataset and benchmark, but the labeling process needs scrutiny for bias. The work does well by constructing Block-R1-41K with best-improved block sizes for each sample and testing across 13 datasets, 7 RL algorithms, and multiple backbones. The open-source release at GitHub and Hugging Face is practical and allows others to reproduce or extend the results. The idea of measuring domain conflict quantitatively through the score is straightforward and could be useful. Where it is softer is the dataset construction step. The best block size labels come from some improvement metric during construction, but without details on whether this used held-out data or separate evaluations, there is a risk that the assignments are tuned to the training distribution itself. This could make the reported improvements in cross-domain RL less general than claimed, as the stress-test suggests. The paper needs to demonstrate that these sizes transfer without circularity. The experiments being extensive is a strength, as it shows the method works across different setups. This is for people focused on RL post-training of dLLMs, particularly in multi-domain scenarios where block size affects rollout efficiency. It provides concrete tools and a benchmark that others can use to compare methods. It deserves serious peer review because the new artifacts and the scale of the experiments are substantial enough to merit referee feedback, even with the need for clearer methods on how the labels were assigned.

Referee Report

2 major / 2 minor

Summary. The paper claims that block size induces domain conflicts in multi-domain RL post-training of diffusion LLMs, materially degrading rollout-based methods such as GRPO. It formulates this conflict, constructs the Block-R1-41K dataset by assigning each sample a 'best-improved training block size' and deriving a quantitative Block Size Conflict Score, introduces the Block-R1 benchmark for single- and cross-domain RL, and proposes a simple sample-level best-block-size training method. Experiments are reported across 13 datasets, 7 RL algorithms, and multiple dLLM backbones, with code and data released.

Significance. If the claims are substantiated without circularity in dataset construction, the work would highlight an under-studied hyperparameter (block size) as a first-order factor in multi-domain dLLM RL, potentially improving post-training effectiveness. The new benchmark and open-sourced dataset would provide reusable artifacts for the community, strengthening reproducibility.

major comments (2)

[Abstract and dataset construction] Abstract and §3 (dataset construction): the procedure for labeling each sample in Block-R1-41K with its 'best-improved training block size' is not specified (e.g., whether held-out rollouts, separate evaluation sets, or the identical training distribution are used). This is load-bearing for the central claim, because if the labeling re-uses the same samples later optimized by RL, the conflict score and reported gains become construction artifacts rather than evidence of intrinsic domain properties.
[Experiments and conflict score] §4 (experiments) and conflict-score definition: without explicit ablation showing that the per-sample best-block-size assignment generalizes to unseen cross-domain RL training (rather than overfitting to the labeling process), the claim that block-size conflict 'will largely affect' post-training effectiveness remains unverified. The abstract supplies no quantitative deltas, R² values, or statistical tests supporting this.

minor comments (2)

[Abstract] The abstract mentions 'extensive experiments' but reports no concrete metrics, baseline comparisons, or ablation tables; adding a summary table of key results would improve readability.
[Methods] Notation for the Block Size Conflict Score should be defined explicitly with an equation in the main text rather than left implicit in the dataset description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to improve clarity and provide additional verification where needed.

read point-by-point responses

Referee: [Abstract and dataset construction] Abstract and §3 (dataset construction): the procedure for labeling each sample in Block-R1-41K with its 'best-improved training block size' is not specified (e.g., whether held-out rollouts, separate evaluation sets, or the identical training distribution are used). This is load-bearing for the central claim, because if the labeling re-uses the same samples later optimized by RL, the conflict score and reported gains become construction artifacts rather than evidence of intrinsic domain properties.

Authors: We agree that the labeling procedure requires explicit description to rule out circularity. The Block-R1-41K construction in §3 assigns best-improved block sizes using held-out rollouts on separate per-domain evaluation sets that are disjoint from the RL training samples. We have revised the abstract to note this separation and expanded §3 with the full procedure, including the reward-based improvement metric and confirmation that no training samples are reused for labeling. This ensures the conflict score reflects intrinsic domain properties. revision: yes
Referee: [Experiments and conflict score] §4 (experiments) and conflict-score definition: without explicit ablation showing that the per-sample best-block-size assignment generalizes to unseen cross-domain RL training (rather than overfitting to the labeling process), the claim that block-size conflict 'will largely affect' post-training effectiveness remains unverified. The abstract supplies no quantitative deltas, R² values, or statistical tests supporting this.

Authors: We acknowledge the need for explicit generalization evidence. In the revised manuscript we added an ablation in §4.3 that applies the per-sample block-size assignments to cross-domain RL on domain combinations held out from the labeling process entirely. Results demonstrate consistent gains, which we now quantify in the abstract (average improvements of 9–14% across the 7 RL algorithms) together with R² correlations between conflict score and performance drop plus t-test p-values. These additions verify the effect beyond the labeling step. revision: yes

Circularity Check

0 steps flagged

Dataset labels and conflict score are new artifacts; no reduction of claims to fitted inputs or self-citations

full rationale

The paper constructs Block-R1-41K by assigning a best-improved training block size per sample and derives a Block Size Conflict Score from those assignments, then proposes a post-training method that uses the sample-level assignments. This process creates new data artifacts and a benchmark rather than fitting parameters to a subset and renaming the fit as a prediction. No equations or steps are shown reducing the central claim (domain block size conflict affecting RL effectiveness) to a self-definitional loop or to quantities defined only by the authors' prior fitted values. Experiments span 13 datasets, 7 RL algorithms, and multiple backbones, with the benchmark released externally. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the derivation chain. The work is therefore self-contained against external benchmarks at the level of a minor self-citation score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the existence of measurable domain block size conflicts that affect RL rollout quality and on the transferability of per-sample best block sizes to cross-domain training. No explicit free parameters are named in the abstract. The conflict score is an invented metric whose construction details are not provided.

axioms (1)

domain assumption Block size determines parallel decoding granularity and affects rollout trajectories during RL optimization such as GRPO
Directly stated in the abstract as the motivation for studying block size.

invented entities (1)

Block Size Conflict Score no independent evidence
purpose: Quantitatively measure domain conflict induced by block size choices
Introduced as part of the Block-R1-41K dataset construction; no independent evidence outside the paper is described.

pith-pipeline@v0.9.0 · 5589 in / 1404 out tokens · 76518 ms · 2026-05-13T06:57:23.417873+00:00 · methodology