arxiv: 2604.06834 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

On the Step Length Confounding in LLM Reasoning Data Selection

Bing Wang , Rui Miao , Chen Shen , Shaotian Yan , Kaiyuan Liu , Ximing Li , Xiaosong Yuan , Sinan Fan

show 2 more authors

Jun Zhang Jieping Ye

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords step length confoundingLLM reasoning datanaturalness-based selectionaverage log probabilitychain-of-thoughtdata debiasingfirst token probability

0 comments

The pith

Naturalness-based selection for LLM reasoning data favors longer steps over higher quality due to first-token effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a widely used method for choosing high-quality reasoning examples for LLM training, which ranks samples by their average log probability under a model, actually prefers data with longer individual reasoning steps. This preference stems from the fact that each reasoning step typically begins with a token the model assigns low probability to, and those low probabilities matter less in the overall average when the step contains more tokens. The authors label this distortion step length confounding and introduce two corrections: one that simply omits the first token's probability when averaging, and another that uses regression to remove its statistical influence. Experiments with four different LLMs and five reasoning benchmarks show that data chosen after these corrections produces stronger downstream reasoning performance.

Core claim

The authors establish that the standard naturalness metric for selecting high-quality chain-of-thought data in large language models is biased toward longer reasoning steps. This step length confounding arises because low-probability tokens at the start of each step have their negative impact diluted when steps contain more tokens. By proposing ASLEC-DROP and ASLEC-CASL to neutralize this effect, they show improved data quality for fine-tuning reasoning models.

What carries the argument

Step length confounding in average log probability scoring, where low-probability first tokens of reasoning steps lose influence as step length increases.

If this is right

Corrected selection produces training sets that improve LLM performance on complex reasoning tasks.
Average log probability alone is insufficient for quality assessment when reasoning data is organized in discrete steps.
The two correction techniques apply consistently across multiple base models and evaluation suites.
Causal regression offers a general way to remove token-position biases in probability-based data filters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar length-based distortions may affect probability averaging in other step-wise or hierarchical generation tasks.
Future synthetic data pipelines for reasoning models should include routine checks for first-token or position effects.
The finding suggests that apparent naturalness in long chains can mask lower per-step quality.

Load-bearing premise

That removing the step-length effect via first-token dropping or causal regression selects genuinely higher-quality reasoning data rather than merely different data whose downstream benefit is not guaranteed beyond the reported experiments.

What would settle it

Train LLMs on data selected by the original naturalness score versus the corrected methods and measure whether the corrected selections produce equal or lower accuracy on held-out reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2604.06834 by Bing Wang, Chen Shen, Jieping Ye, Jun Zhang, Kaiyuan Liu, Rui Miao, Shaotian Yan, Sinan Fan, Xiaosong Yuan, Ximing Li.

**Figure 3.** Figure 3: Representative cases illustrating token-level log probabilities for varying step lengths. tween step length and log probability. For longer steps, the ratio of low-probability first tokens is lower. To further investigate the cause of the monotonic relationship between step length and step-level log probability, we examine several representative examples in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Step length distributions for data selected versus unselected by our two proposed variant methods. lem, we present in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between response-level log prob [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Average log probability of tokens at different [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 8.** Figure 8: Data selection bias and step length distribu [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 7.** Figure 7: Step length distributions for selected and unselected data and relationship between step-level log probability and step length on AceReason-1.1-SFT. we reach a total of 10k problems, each paired with its corresponding correct response. As a result, each problem in LIMO-v2 may contain up to 75 incorrect responses and at least 25 correct ones. All generated response data are publicly available via the link … view at source ↗

**Figure 9.** Figure 9: Convergence analysis. step length-distribution differences under these two additional step splitting methods. The analysis results are presented in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a genuine length bias in average-log-prob selection for reasoning traces and gives two straightforward fixes that improve benchmark scores.

read the letter

The key point is that naturalness scoring via average log probability systematically picks longer reasoning steps over better ones. The authors trace this to low-probability first tokens in each step; longer steps simply dilute those low values in the average, inflating the score without improving the actual reasoning quality. They demonstrate the effect through direct probability analysis and then test two targeted corrections: ASLEC-DROP, which skips the first-token probability when averaging, and ASLEC-CASL, which fits a causal regression to remove the confounding term. Both are applied to data from four LLMs and evaluated on five benchmarks, where they outperform the uncorrected baseline. This is a clean, practical contribution for anyone curating reasoning datasets. The mechanism is easy to verify and the remedies require only minor changes to existing pipelines. The experiments are broad enough to show the issue is not limited to one model family. The main weakness is that downstream benchmark gains are the only evidence offered for higher quality. There are no human ratings of reasoning coherence, step-level correctness checks, or other orthogonal signals to confirm the new selections are genuinely better traces rather than simply shorter or differently distributed ones. That gap leaves some uncertainty about whether the fixes recover quality or just alter surface statistics. The work is aimed at researchers doing supervised fine-tuning for chain-of-thought models. It is worth sending to peer review because the diagnosis is direct and the methods are reproducible, though a revision should add at least one independent quality measure.

Referee Report

1 major / 0 minor

Summary. The paper claims that naturalness-based data selection via average log-probability on LLM reasoning datasets exhibits 'step length confounding,' systematically favoring samples with longer reasoning steps because low-probability first tokens are diluted in longer sequences. It attributes this mechanistically to first-token effects, introduces ASLEC-DROP (dropping first-token probabilities) and ASLEC-CASL (causal regression debiasing), and reports that both variants yield better downstream performance than the original metric across four LLMs and five benchmarks.

Significance. If the debiasing methods reliably recover higher-quality reasoning traces rather than merely length-adjusted ones, the work would meaningfully advance data curation practices for supervised fine-tuning of reasoning models. The mechanistic attribution of the confounding effect is a clear, actionable insight that could reduce reliance on heuristic filters and improve the efficiency of constructing long-CoT datasets.

major comments (1)

[Abstract] Abstract and experimental validation: the central claim that ASLEC-DROP and ASLEC-CASL select 'higher-quality' data (rather than simply different data whose length distribution has been altered) rests exclusively on downstream benchmark gains. No orthogonal quality signal—such as human coherence ratings, step-level correctness verification, or error analysis independent of surface statistics—is reported to confirm that the newly selected samples are superior reasoning traces.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on validation. We address the major comment point by point below, acknowledging where revisions are needed to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and experimental validation: the central claim that ASLEC-DROP and ASLEC-CASL select 'higher-quality' data (rather than simply different data whose length distribution has been altered) rests exclusively on downstream benchmark gains. No orthogonal quality signal—such as human coherence ratings, step-level correctness verification, or error analysis independent of surface statistics—is reported to confirm that the newly selected samples are superior reasoning traces.

Authors: We agree that the manuscript's evidence for improved data selection rests primarily on downstream benchmark improvements across four LLMs and five tasks, which directly measure the practical utility of the selected reasoning traces for supervised fine-tuning. This metric is standard in data curation literature because it evaluates end-to-end impact on model reasoning performance rather than proxy signals. However, the referee is correct that we do not provide independent orthogonal validation such as human ratings or step-level error analysis. In the revision we will (1) qualify the abstract and introduction to emphasize that ASLEC variants mitigate step-length confounding and yield data with better downstream utility, without overclaiming absolute 'higher quality'; (2) add a new subsection with quantitative comparison of step-length distributions, first-token probability statistics, and average reasoning-step coherence proxies (e.g., token entropy) between original and debiased selections to demonstrate the mechanistic effect; and (3) include a brief limitations paragraph noting the absence of human or manual verification and suggesting it as future work. These changes will make the claims more precise while preserving the core experimental results. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation of step-length confounding

full rationale

The paper derives the step-length confounding effect directly from the arithmetic definition of average log-probability (sum of token log-probs divided by sequence length), showing mathematically how low-probability first tokens are diluted in longer steps. The proposed ASLEC-DROP and ASLEC-CASL variants are explicit, non-circular corrections (dropping or regressing the first-token term) whose claimed benefit is tested against external downstream benchmarks rather than by re-fitting or redefining quality within the same metric. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the core chain; the analysis remains self-contained against observable probability distributions and held-out task performance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the causal regression implicitly involves fitted coefficients whose details are not provided.

pith-pipeline@v0.9.0 · 5534 in / 1040 out tokens · 41143 ms · 2026-05-10T18:56:22.603257+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Reasoning with exploration: An entropy per- spective.CoRR, abs/2506.14758. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. The entropy mechanism of rein- forcement learning for reasoning language...

work page arXiv 2025
[2]

The Signal is in the Steps: Local Scoring for Reasoning Data Selection

Distilling reasoning into student llms: Lo- cal naturalness for selecting teacher data.CoRR, abs/2510.03988. Zhewei Kang, Xuandong Zhao, and Dawn Song. 2025. Scalable best-of-n selection for large language mod- els via self-certainty.CoRR, abs/2502.18581. Zhiqiang Kou, Junyang Chen, Xin-Qiang Cai, Xiaobo Xia, Ming-Kun Xie, Dong-Dong Wu, Biao Liu, Yuheng J...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Kaiyuan Liu, Shaotian Yan, Rui Miao, Bing Wang, Chen Shen, Jun Zhang, and Jieping Ye. 2026. Where did this sentence come from? tracing provenance in LLM reasoning distillation. InInternational Confer- ence on Learning Representations. Kevin Lu and Thinking Mac...

work page arXiv 2026
[4]

s1: Simple test-time scaling

Scaling data-constrained language models. In Annual Conference on Neural Information Process- ing Systems. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling.CoRR, abs/2501.19393. David Rein, Betty Li H...

work page Pith review arXiv 2025
[5]

The best instruction-tuning data are those that fit,

Differential fine-tuning large language models towards better diverse reasoning abilities. InInterna- tional Conference on Learning Representations. Xiaosong Yuan, Chen Shen, Shaotian Yan, Xiaofeng Zhang, Liang Xie, Wenxiao Wang, Renchu Guan, Ying Wang, and Jieping Ye. 2024. Instance-adaptive zero-shot chain-of-thought prompting. InAdvances in Neural Info...

work page arXiv 2024
[6]

• gpt-oss-120b6 (Agarwal et al., 2025) improves inference speed by combining compact atten- tion layers with linear attention layers, while activating only 5B parameters

is one of the first to employ reinforce- ment learning to enhance long CoT reason- ing in LLMs, providing evidence that models obtained through distillation can still exhibit robust reasoning abilities. • gpt-oss-120b6 (Agarwal et al., 2025) improves inference speed by combining compact atten- tion layers with linear attention layers, while activating onl...

work page arXiv 2025