pith. machine review for the scientific record. sign in

arxiv: 2605.04077 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords GRPOaggregation biasreinforcement learning with verifiable rewardspolicy gradientLLM reasoningtoken aggregationsequence aggregation
0
0 comments X

The pith

Balanced Aggregation fixes GRPO bias by averaging token gradients separately in positive and negative groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two distinct biases that arise when aggregating token-level policy gradients inside each GRPO group: token aggregation ties gradient sign to response length, while sequence aggregation gives every sequence equal weight and thereby downweights longer answers. Balanced Aggregation corrects both by first computing separate token-level means inside the positive and negative subsets, then weighting those means by the number of sequences in each subset. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris data show that this change yields more stable training and higher scores on six reasoning and coding benchmarks than either standard rule. A reader should care because the aggregation step is a free design choice that directly shapes the optimization signal in reward-driven language-model training.

Core claim

Token aggregation introduces sign-length coupling while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting; Balanced Aggregation removes both effects by computing token-level means separately within the positive and negative subsets and then combining them with sequence-count-based weights.

What carries the argument

Balanced Aggregation (BA), which computes token-level means separately within the positive and negative subsets and combines them with sequence-count-based weights.

If this is right

  • Training stability improves across runs that differ in response-length distribution.
  • Final performance rises on reasoning and coding benchmarks relative to both token and sequence aggregation baselines.
  • The relative advantage of token versus sequence aggregation is governed by the amount of response-length variation and the length gap between positive and negative examples.
  • Aggregation rule becomes an explicit, tunable design dimension in GRPO-style RLVR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation principle could be tested in other group-relative RL algorithms that currently use uniform aggregation.
  • Controlling or stratifying training data by positive-negative length gap would let practitioners predict when BA yields the largest lift.
  • Future RLVR papers should report length statistics for positive and negative groups to make aggregation effects comparable across studies.

Load-bearing premise

The observed performance gains are caused by removal of the identified aggregation biases rather than by incidental changes in gradient scale or effective learning rate.

What would settle it

An experiment that equalizes gradient norms between Balanced Aggregation and the two standard rules while preserving the positive-negative separation would show no remaining advantage if bias removal is not the operative mechanism.

Figures

Figures reproduced from arXiv: 2605.04077 by Bingrui Li, Ge Zhang, Jiameng Huang, Jiashuo Liu, Wenhao Huang, Xipeng Qiu, Yining Zheng, Yuhao Wu, Zhangyue Yin, Zhiyuan Zeng, Ziniu Li.

Figure 1
Figure 1. Figure 1: Comparison of peak and last-step accuracy across evaluation benchmarks. For each [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics of the policy-gradient loss for Qwen2.5-Math-7B on DAPO-17k and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Length-distribution statistics related to the relative behavior of token and sequence aggrega [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbf{Balanced Aggregation (BA)}, a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based weights. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris, evaluated on six reasoning and coding benchmarks, show that BA consistently improves training stability and final performance over standard token and sequence aggregation. Our analysis further shows that the relative effectiveness of token and sequence aggregation is largely governed by response-length variation and the positive-negative length gap, highlighting aggregation as a critical design dimension in GRPO-style RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies aggregation biases in GRPO-style RLVR: token aggregation induces sign-length coupling while sequence aggregation downweights longer responses via equal weighting. It proposes Balanced Aggregation (BA) as a drop-in fix that computes separate token-level means over positive and negative subsets then reweights by sequence counts. Experiments on Qwen2.5-Math-7B and Qwen3-1.7B trained on DAPO-17k and Polaris, evaluated across six reasoning and coding benchmarks, report that BA yields improved training stability and final performance relative to standard token and sequence aggregation.

Significance. If the gains can be attributed specifically to bias removal, the work would usefully spotlight an under-examined design choice in policy-gradient aggregation for verifiable-reward RL. The proposal is simple and the evaluation spans two model scales and two datasets, which strengthens the practical case. The analysis tying relative effectiveness to response-length statistics is a constructive contribution. However, without controls that isolate mechanism from scale, the significance remains provisional.

major comments (2)
  1. [Experiments] Experiments section: the direct comparisons of BA against token and sequence aggregation do not include gradient-norm matching, learning-rate rescaling, or scale-controlled ablations. Different aggregation operators alter the magnitude of the summed policy gradient, so the reported stability and benchmark gains cannot be confidently attributed to removal of sign-length coupling or length downweighting rather than incidental changes in effective update scale.
  2. [Results] Results: no error bars, statistical significance tests, or quantitative effect sizes are reported for the claimed consistent improvements in stability and performance. This weakens the ability to assess reliability of the central empirical claim.
minor comments (1)
  1. [Abstract] Abstract: the statement that BA 'consistently improves' would be more informative if accompanied by concrete metrics or relative gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the attribution of our empirical results. We address each major comment below and have revised the manuscript to incorporate additional controls and statistical reporting.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the direct comparisons of BA against token and sequence aggregation do not include gradient-norm matching, learning-rate rescaling, or scale-controlled ablations. Different aggregation operators alter the magnitude of the summed policy gradient, so the reported stability and benchmark gains cannot be confidently attributed to removal of sign-length coupling or length downweighting rather than incidental changes in effective update scale.

    Authors: We agree that isolating the effect of bias removal from incidental changes in gradient magnitude is important for causal attribution. In the revised manuscript, we have added scale-controlled ablations: we normalize the summed policy gradient norms to match across the three aggregation methods and rescale the learning rate to preserve comparable update magnitudes. These experiments, now included in the Experiments section, show that Balanced Aggregation retains its advantages in training stability and final benchmark performance. We have also added a brief analysis of per-method gradient norms to the paper. revision: yes

  2. Referee: [Results] Results: no error bars, statistical significance tests, or quantitative effect sizes are reported for the claimed consistent improvements in stability and performance. This weakens the ability to assess reliability of the central empirical claim.

    Authors: We acknowledge that the lack of variability measures and formal statistical tests limits the strength of the empirical claims. We have rerun the main experiments across multiple random seeds, added standard-deviation error bars to all benchmark tables and training curves, and included paired t-test p-values together with Cohen's d effect sizes for the reported improvements. These additions appear in the revised Results section and the associated figures. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation and proposal are self-contained algorithmic definitions

full rationale

The paper derives the two identified biases directly from the explicit definitions of token aggregation (sign-length coupling via per-token means) and sequence aggregation (length downweighting via equal sequence weighting). BA is introduced as an independent operator that splits positive/negative subsets and reweights by sequence counts; this is not equivalent to either baseline by construction. No equations reduce a 'prediction' to a fitted parameter, no uniqueness theorem is imported, and no self-citation chain carries the central claim. Experiments provide external empirical validation on held-out benchmarks rather than internal consistency checks. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the performance difference is attributable to the removal of length-related bias rather than to changes in gradient magnitude or effective batch size; no free parameters are introduced beyond the existing GRPO formulation.

axioms (1)
  • domain assumption The sign of the advantage for a token is independent of response length in the desired optimum.
    Implicit in the claim that sign-length coupling is an unwanted bias.

pith-pipeline@v0.9.0 · 5564 in / 1371 out tokens · 37381 ms · 2026-05-10T14:49:54.735502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 21 canonical work pages · 13 internal anchors

  1. [1]

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

  2. [2]

    Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

    HKU NLP Group, ByteDance Seed, and Fudan University. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URL https://hkunlp. github.io/blog/2025/Polaris/

  3. [3]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  4. [4]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InAnnual Meeting of the Association for Computational Linguistics, 2024

  5. [5]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida I. Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/ 2403.07974

  6. [6]

    Dhillon, David Brandfonbrener, and Rishabh Agarwal

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling re- inforcement learning compute for llms, 2025. URL https://arxiv.org/abs/2510.13786

  7. [7]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URLhttps://arxiv.org/abs/2206.14858

  8. [8]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050

  9. [9]

    When speed kills stability: Demystifying RL collapse from the training-inference mismatch, sep 2025

    Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying RL collapse from the training-inference mismatch, sep 2025. URL https://richardli.xyz/rl-collapse

  10. [10]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https: //arxiv.org/abs/2503.20783

  11. [11]

    Part i: Tricks or traps? a deep dive into rl for llm reasoning, 2025

    Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Johan Obando-Ceron, Siran Yang, Jiamang Wang, Wenbo Su, and Bo Zheng. Part i: Tricks or traps? a deep dive into rl for llm reasoning, 2025. URLhttps://arxiv.org/abs/2508.08221

  12. [12]

    Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

    Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing moe reinforcement learning by aligning training and inference routers, 2025. URL https://arxiv.org/abs/2510.11370

  13. [13]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang 12 Cai, Haicha...

  14. [14]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

  15. [15]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  16. [16]

    Yiping Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, Daya Guo, Dejian Yang, Dejian Yang, and Ruoyu Zhang. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/ abs/2402.03300

  17. [17]

    Opencompass: A universal evaluation platform for foundation models,

    OpenCompass Team. Opencompass: A universal evaluation platform for foundation models,

  18. [18]

    URLhttps://arxiv.org/abs/2410.16256

  19. [19]

    arXiv preprint arXiv:2510.06062 , year=

    Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization, 2025. URL https://arxiv.org/abs/2510.06062

  20. [20]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024. URLhttps://arxiv.org/abs/2409.12122

  21. [21]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  22. [22]

    Your efficient rl framework secretly brings you off-policy rl training, aug 2025

    Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, aug 2025. URLhttps: //fengyao.notion.site/off-policy-rl

  23. [23]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  24. [24]

    Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective, 2024

    Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, and Xipeng Qiu. Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective, 2024. URL https://arxiv.org/abs/2412. 14135

  25. [25]

    Geometric-mean policy optimization,

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization,

  26. [26]

    URLhttps://arxiv.org/abs/2507.20673

  27. [27]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071

  28. [28]

    Rloop: An self-improving framework for reinforcement learning with iterative policy initialization,

    Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, and Xipeng Qiu. Rloop: An self-improving framework for reinforcement learning with iterative policy initialization,

  29. [29]

    14 Appendix A: Why Use Sequence-Count Weights in BA? Here we briefly justify the choice of weights k/G and (G−k)/G in BA

    URLhttps://arxiv.org/abs/2511.04285. 14 Appendix A: Why Use Sequence-Count Weights in BA? Here we briefly justify the choice of weights k/G and (G−k)/G in BA. Recall that under the binary-reward GRPO setting, the normalized advantages are ˆAi = r G−k k fori∈S +, ˆAi =− r k G−k fori∈S −.(24) Substituting these values into the BA objective gives JBA = k G r...