pith. machine review for the scientific record. sign in

arxiv: 2604.02007 · v2 · submitted 2026-04-02 · 💻 cs.LG

Recognition: no theorem link

Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning

Alexandre Drouin, David Vazquez, Ehsan Kamalloo, Rafael Pardinas

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords RL post-trainingreasoning modelsmulti-domain RLVRadaptive domain samplinglength penaltychain-of-thought efficiencyopen-weight LLMs
0
0 comments X

The pith

RL post-training with adaptive domain sampling and a difficulty-aware length penalty trains a 15B model to improve on reasoning benchmarks while cutting token usage by 30-50%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a fully reproducible RL post-training recipe applied to a 15B open-weight base model across five domains using public datasets. It introduces an adaptive sampling method that keeps domain ratios stable even when rollout lengths differ, along with a length penalty that lengthens traces for hard problems and shortens them for easy ones. Trained under a fixed 16K token output limit, the resulting model generalizes to 32K tokens at inference time and beats the base model on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench. It also produces substantially shorter reasoning traces than the base while matching other strong open models at lower token cost. This matters because long chain-of-thought outputs raise inference latency and expense, limiting practical use of reasoning models.

Core claim

Apriel-1.5-OpenReasoner is obtained by applying multi-domain RLVR post-training to Apriel-Base with an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics and a difficulty-aware extension of the standard length penalty that requires no additional training overhead. Under a strict 16K-token output budget the model generalizes to 32K tokens at inference, improves over the base on AIME 2025, GPQA, MMLU-Pro and LiveCodeBench, and generates 30-50% shorter reasoning traces while matching the performance of other strong open-weight models of similar size at lower token cost.

What carries the argument

Adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, combined with a difficulty-aware extension of the length penalty.

If this is right

  • The trained model improves accuracy on mathematics, science, and code benchmarks while using fewer tokens per response.
  • A 16K training budget suffices for generalization to 32K-token inference contexts.
  • Reasoning traces become 30-50% shorter on average without loss of performance.
  • The approach reaches the performance level of other strong open models of similar size at reduced inference token cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling and penalty controls could be ported to larger base models to reduce their inference cost without redesigning the reward model.
  • Because the recipe is fully reproducible with public data, other groups can test whether the efficiency gains hold when the base model changes.
  • The difficulty-aware penalty might extend to non-verifiable domains such as open-ended writing if suitable proxy difficulty signals are available.

Load-bearing premise

The adaptive domain sampling continues to preserve intended domain ratios even when rollout lengths and difficulties vary widely across domains, and the difficulty-aware length penalty produces the desired trace-length behavior without extra training cost.

What would settle it

Retraining the base model with the adaptive sampling and length penalty but observing no gain on AIME 2025 or GPQA and no 30% reduction in average trace length would falsify the central claim.

read the original abstract

Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-1.5-OpenReasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-1.5-OpenReasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript presents Apriel-1.5-OpenReasoner, a 15B-parameter open-weight LLM obtained by RLVR post-training of Apriel-Base on five public-domain datasets (mathematics, code generation, instruction following, logical puzzles, function calling). It introduces an adaptive domain sampling mechanism claimed to preserve target domain ratios under heterogeneous rollout lengths and a difficulty-aware length penalty that encourages longer traces on hard problems and shorter ones on easy problems with no extra overhead. Trained under a strict 16K-token output budget, the model generalizes to 32K tokens at inference, improves over the base on AIME 2025, GPQA, MMLU-Pro and LiveCodeBench, and yields 30-50% shorter reasoning traces while matching strong open models at lower token cost.

Significance. If the reported gains and efficiency improvements are reproducible, the work would be significant for open-weight reasoning models by supplying a fully disclosed multi-domain RL post-training recipe that directly tackles rollout-length heterogeneity and inference-cost concerns, thereby advancing the accuracy-versus-token-budget Pareto frontier for models of this scale.

major comments (3)
  1. [Adaptive domain sampling mechanism] The central claim that the adaptive domain sampling mechanism preserves target domain ratios despite heterogeneous rollout dynamics (mathematics vs. code generation, etc.) is load-bearing for the multi-domain generalization results, yet the manuscript provides no explicit update rule, length-normalization step, or reweighting formula; without these details it is impossible to verify that domains with longer traces are not systematically under-sampled relative to the intended mixture.
  2. [Difficulty-aware length penalty] The difficulty-aware length penalty is presented as a zero-overhead extension of the standard penalty that modulates trace length by problem difficulty, but no equation, coefficient schedule, or pseudocode is supplied; this omission prevents assessment of whether the mechanism actually produces the claimed 30-50% shorter traces without degrading accuracy on hard problems.
  3. [Experimental results] Benchmark improvements are reported without error bars, number of runs, or ablation studies isolating the contribution of adaptive sampling versus the length penalty; given that RLVR results are known to be sensitive to unreported hyperparameters, the quantitative claims on AIME 2025, GPQA, MMLU-Pro and LiveCodeBench cannot be verified from the current experimental section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We will revise the manuscript to supply the explicit formulations, update rules, and pseudocode for the adaptive domain sampling and difficulty-aware length penalty, as well as expand the experimental section with error bars, multiple runs, and ablations. These additions will directly address the reproducibility concerns while preserving the core contributions.

read point-by-point responses
  1. Referee: [Adaptive domain sampling mechanism] The central claim that the adaptive domain sampling mechanism preserves target domain ratios despite heterogeneous rollout dynamics (mathematics vs. code generation, etc.) is load-bearing for the multi-domain generalization results, yet the manuscript provides no explicit update rule, length-normalization step, or reweighting formula; without these details it is impossible to verify that domains with longer traces are not systematically under-sampled relative to the intended mixture.

    Authors: We agree the details were omitted and apologize for the oversight. The adaptive domain sampling maintains target ratios by computing per-domain average rollout lengths within each training batch and reweighting the sampling probabilities accordingly: the effective sampling probability p_d for domain d is set to target_ratio_d * (global_avg_length / avg_length_d), followed by renormalization across domains. This length-normalized reweighting prevents longer-trace domains from being under-sampled. The revised manuscript will include the full update rule, normalization step, and pseudocode for the batch-wise procedure. revision: yes

  2. Referee: [Difficulty-aware length penalty] The difficulty-aware length penalty is presented as a zero-overhead extension of the standard penalty that modulates trace length by problem difficulty, but no equation, coefficient schedule, or pseudocode is supplied; this omission prevents assessment of whether the mechanism actually produces the claimed 30-50% shorter traces without degrading accuracy on hard problems.

    Authors: We acknowledge the missing formulation. The difficulty-aware penalty extends the standard length penalty as L_diff = lambda * max(0, length - target_length(d)), where target_length(d) is a difficulty-scaled baseline (longer for hard problems, shorter for easy ones) derived from base-model accuracy on a held-out difficulty proxy. Lambda is scheduled linearly with estimated difficulty. This incurs no extra overhead beyond the standard penalty. The revision will add the exact equation, coefficient schedule, and pseudocode. revision: yes

  3. Referee: [Experimental results] Benchmark improvements are reported without error bars, number of runs, or ablation studies isolating the contribution of adaptive sampling versus the length penalty; given that RLVR results are known to be sensitive to unreported hyperparameters, the quantitative claims on AIME 2025, GPQA, MMLU-Pro and LiveCodeBench cannot be verified from the current experimental section.

    Authors: We agree that additional statistical rigor is needed. In the revised version we will report all benchmark numbers as means over three independent training runs with standard deviation error bars. We will also include a new ablation table that isolates the base model, adaptive sampling only, length penalty only, and the combined method, thereby quantifying each component's contribution to the reported gains on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmarks and public data

full rationale

The paper presents an RL post-training recipe using public datasets across five domains. The adaptive domain sampling and difficulty-aware length penalty are introduced as engineering mechanisms to handle rollout heterogeneity; their success is measured by downstream accuracy and token-length improvements on held-out external benchmarks (AIME 2025, GPQA, MMLU-Pro, LiveCodeBench). No equation or result is shown to be equivalent to its own fitted inputs by construction, and no load-bearing premise reduces to a self-citation chain. The derivation chain is therefore self-contained against independent evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

Training relies on standard RL assumptions for verifiable rewards and rollout dynamics; the adaptive sampler and length penalty introduce a small number of tunable hyperparameters whose exact values are not disclosed in the abstract.

free parameters (2)
  • target domain ratios
    Used by the adaptive sampling mechanism to maintain intended proportions across domains with different rollout lengths.
  • difficulty-aware length penalty coefficients
    Control how much longer reasoning is encouraged for hard problems versus short traces for easy ones.

pith-pipeline@v0.9.0 · 5574 in / 1193 out tokens · 40562 ms · 2026-05-13T21:50:04.396696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    doi: 10.18653/v1/2024.acl-long.662. Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. MathArena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/. Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et a...

  2. [2]

    doi: 10.18653/v1/2025.findings-acl.1274

    Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1274. 10 Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arber, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.Advances in Neural Information Processing Systems, 34:5074–5085, 2021. Jingcheng Hu, Yinmin ...

  3. [3]

    Olmo 3

    doi: 10.18653/v1/2025.emnlp-main.1025. Olmo Team, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025. Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzale...