arxiv: 2604.15602 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

Jixuan Leng , Si Si , Hsiang-Fu Yu , Vinod Raman , Inderjit S. Dhillon

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords group-wise preference optimizationmemory efficient trainingdirect preference optimizationLLM alignmentmultiple responsesbackpropagation decoupling

0 comments

The pith

Decoupling samples during backpropagation preserves exact gradients in group-wise preference optimization while cutting peak memory use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the memory bottleneck that has kept group-wise preference optimization out of reach for large language models. Most alignment methods train on one positive and one negative response per prompt even though preference datasets usually supply several candidates. Jointly contrasting a whole group of responses should extract more signal, but the coupled loss inflates memory during backpropagation. The new algorithm breaks the coupling only in the backward pass so that each sample's gradient is computed independently yet remains mathematically identical to the joint objective. Experiments in both offline and online regimes show that larger groups outperform single-pair training, and an added negative log-likelihood term on the positive responses is required for both the performance lift and stable optimization.

Core claim

GroupDPO decouples the computation graph during backpropagation to reduce memory while exactly preserving the gradients of the original group-coupled objective. This allows training with larger numbers of responses per prompt, leading to superior alignment performance compared to single-pair methods, with the addition of an NLL term on positive responses being essential for both gains and stability.

What carries the argument

The memory-efficient decoupling of samples during backpropagation, which separates forward and backward passes across responses while keeping the mathematical gradient identical to the joint objective.

If this is right

Using multiple responses per prompt yields better performance than single positive-negative pairs in both offline and online alignment.
Larger group sizes become feasible without exceeding memory limits.
Including a negative log-likelihood term on positive responses improves performance and training stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling trick could apply to other multi-sample objectives that currently suffer from coupled backpropagation.
One could measure whether performance keeps rising as group size increases further or eventually plateaus.
Datasets that supply dozens of candidate responses per prompt now become practical to use in full.

Load-bearing premise

The decoupling step during backpropagation preserves the exact gradients of the group-wise objective without any approximation error or bias.

What would settle it

Run a small-scale experiment comparing the gradients computed by the decoupled method against the full group-coupled loss on the same batch; if they differ by more than floating-point error, the preservation claim fails.

Figures

Figures reproduced from arXiv: 2604.15602 by Hsiang-Fu Yu, Inderjit S. Dhillon, Jixuan Leng, Si Si, Vinod Raman.

**Figure 2.** Figure 2: Effect of group size for olmo-3-7b-it-sft. impact of group size on performance in the offline setting using olmo-3-7b-it-sft. As the group size increases from 2 to 8, performance consistently improves across most group-wise variants, suggesting that incorporating more responses per prompt provides richer and more informative comparative signal during training. However, gains begin to plateau beyond this r… view at source ↗

**Figure 3.** Figure 3: Offline results with and without the NLL term [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training curves in the offline (top two figures) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Peak memory overhead and average step la [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Peak memory overhead and average step latency for [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GroupDPO gives a memory-saving decoupling trick for group-wise DPO and shows gains from multi-response data plus an NLL term, but the exact gradient preservation needs a derivation to confirm it does not alter the objective.

read the letter

The main point is a backpropagation decoupling that lets you run group-wise DPO on larger sets of responses per prompt without the usual memory spike, while claiming the gradients stay identical to the coupled version. They also report that training on multiple responses beats the usual single positive-negative pair, and that adding an NLL term on the positive responses matters for both gains and stability in offline and online settings.

Referee Report

2 major / 0 minor

Summary. The paper introduces GroupDPO, a memory-efficient group-wise Direct Preference Optimization algorithm that decouples samples during backpropagation while claiming to exactly preserve gradients of the original group-coupled objective. This enables training with larger group sizes from multi-response preference datasets. Empirical results across offline and online alignment settings show that multi-response training outperforms single-pair baselines, with an added negative log-likelihood (NLL) term on positive responses being critical for both performance gains and training stability.

Significance. If the gradient-preservation claim holds without approximation error or bias, the work would be significant for scalable LLM alignment: it directly addresses the memory bottleneck of group-wise objectives, allowing richer use of preference data with multiple responses per prompt. The consistent outperformance of larger groups plus the NLL ablation provide practical guidance for preference optimization pipelines.

major comments (2)

The central technical claim (gradient preservation under sample decoupling) is load-bearing for both the memory-efficiency and performance arguments, yet the abstract provides no derivation, pseudocode, or error analysis. A formal proof or explicit backpropagation expansion showing equivalence to the original group-coupled loss is required; without it, the reported gains could stem from an implicit change to the optimized objective rather than pure efficiency.
The weakest assumption—that decoupling introduces no bias or approximation error—directly affects the interpretation of the offline/online results. An ablation comparing gradients (or loss values) of the coupled vs. decoupled implementations on a small group size would be needed to confirm exact preservation before claiming scalability benefits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of GroupDPO for scalable preference optimization. We address the two major comments below regarding the gradient preservation claim.

read point-by-point responses

Referee: The central technical claim (gradient preservation under sample decoupling) is load-bearing for both the memory-efficiency and performance arguments, yet the abstract provides no derivation, pseudocode, or error analysis. A formal proof or explicit backpropagation expansion showing equivalence to the original group-coupled loss is required; without it, the reported gains could stem from an implicit change to the optimized objective rather than pure efficiency.

Authors: We agree that a formal justification of gradient equivalence is essential to support the claims. The current manuscript (Section 3) describes the decoupling approach—computing the full group-coupled loss in the forward pass while decoupling samples only during backpropagation—and states that this yields identical gradients. However, we acknowledge the absence of an explicit derivation or pseudocode in the abstract and main text. In the revised manuscript, we will add a formal proof in the appendix, including the backpropagation expansion for the group-wise loss (e.g., the relevant log-sum-exp terms) demonstrating mathematical equivalence to the coupled objective, along with pseudocode for the implementation. revision: yes
Referee: The weakest assumption—that decoupling introduces no bias or approximation error—directly affects the interpretation of the offline/online results. An ablation comparing gradients (or loss values) of the coupled vs. decoupled implementations on a small group size would be needed to confirm exact preservation before claiming scalability benefits.

Authors: We concur that empirical confirmation of exact gradient preservation would strengthen the interpretation of the results. We will incorporate an ablation in the revised experiments section that directly compares gradient norms, loss values, and update directions between the original coupled implementation and the decoupled version on small group sizes (e.g., 2 and 4) using a controlled subset of data. This will verify numerical equivalence within floating-point tolerance and rule out any systematic bias. revision: yes

Circularity Check

0 steps flagged

No circularity: GroupDPO is an original algorithmic reformulation with independent gradient claim

full rationale

The paper presents GroupDPO as a new backpropagation decoupling technique that claims exact gradient preservation for the group-wise objective. This is an algorithmic engineering step rather than a mathematical derivation that reduces to prior results or fitted quantities by construction. No equations or self-citations are shown to make the preservation claim tautological; the central assertion stands as a design choice whose correctness is left to verification (pseudocode or proof), not as a renaming or self-referential fit. Empirical results on larger groups and NLL are separate from the derivation chain. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on an unshown gradient-preservation argument and on the empirical necessity of the NLL term.

pith-pipeline@v0.9.0 · 5461 in / 1098 out tokens · 33898 ms · 2026-05-10T09:40:45.903449+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 14 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, and 1 others. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

work page internal anchor Pith review arXiv 2025
[4]

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, and 1 others. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925

work page internal anchor Pith review arXiv 2025
[5]

Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, and Tat-Seng Chua. 2024 a . On softmax direct preference optimization for recommendation. Advances in Neural Information Processing Systems, 37:27463--27489

2024
[6]

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. 2024 b . Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335

work page internal anchor Pith review arXiv 2024
[7]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30

2017
[8]

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, and 1 others. 2023. Ultrafeedback: Boosting language models with scaled ai feedback. arXiv preprint arXiv:2310.01377

work page arXiv 2023
[9]

Kairui Fu, Changfa Wu, Kun Yuan, Binbin Cao, Dunxian Huang, Yuliang Yan, Junjun Zheng, Jianning Zhang, Silu Zhou, Jian Wu, and 1 others. 2026. Rankgr: Rank-enhanced generative retrieval with listwise direct preference optimization in recommendation. arXiv preprint arXiv:2602.08575

work page arXiv 2026
[10]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Nagarajan Natarajan, Chetan Bansal, and Saravan Rajmohan. 2024. Multi-preference optimization: Generalizing dpo via set-level contrasts. arXiv preprint arXiv:2412.04628

work page arXiv 2024
[12]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, and 1 others. 2024. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume...

2024
[13]

Jiwoo Hong, Noah Lee, and James Thorne. 2024. Orpo: Monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170--11189

2024
[14]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974

work page internal anchor Pith review arXiv 2024
[15]

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, and 1 others. 2024. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124

work page internal anchor Pith review arXiv 2024
[16]

Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. 2024. Taming overconfidence in llms: Reward calibration in rlhf. arXiv preprint arXiv:2410.09724

work page arXiv 2024
[17]

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, and 1 others. 2022. Solving quantitative reasoning problems with language models. Advances in neural information processing systems, 35:3843--3857

2022
[18]

Zihao Li, Chao Yang, Tong Zhang, Yakun Chen, Xianzhi Wang, Guandong Xu, and Daoyi Dong. 2025. Listwise preference alignment optimization for tail item recommendation. IEEE Transactions on Computational Social Systems

2025
[19]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let's verify step by step. In The twelfth international conference on learning representations

2023
[20]

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. 2025. Zebralogic: On the scaling limits of llms for logical reasoning. arXiv preprint arXiv:2502.01100

work page arXiv 2025
[21]

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, and 1 others. 2025 a . Skywork-reward-v2: Scaling preference data curation via human-ai synergy. arXiv preprint arXiv:2507.01352

work page arXiv 2025
[22]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36:21558--21572

2023
[23]

Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, and 1 others. 2025 b . Lipo: Listwise preference optimization through learning-to-rank. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lang...

2025
[24]

Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, and Xunliang Cai. 2024 a . Length desensitization in direct preference optimization. arXiv preprint arXiv:2409.06411

work page arXiv 2024
[25]

Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, and Zhaoran Wang. 2024 b . Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. Advances in Neural Information Processing Systems, 37:138663--138697

2024
[26]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37:124198--124235

2024
[28]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, and 1 others. 2025. Olmo 3. arXiv preprint arXiv:2512.13961

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

2022
[30]

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. 2024. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228

work page arXiv 2024
[31]

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. 2024. Iterative reasoning preference optimization. Advances in Neural Information Processing Systems, 37:116617--116637

2024
[32]

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. 2024. Disentangling length from quality in direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4998--5017

2024
[33]

Robin L Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193--202

1975
[34]

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. 2025. Generalizing verifiable instruction following. arXiv preprint arXiv:2507.02833

work page arXiv 2025
[35]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728--53741

2023
[36]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First conference on language modeling

2024
[37]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics, pages 400--407

1951
[38]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. 2024. Preference ranking optimization for human alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990--18998

2024
[40]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, and 1 others. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research

2023
[41]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. https://github.com/huggingface/trl TRL: Transformers Reinforcement Learning

2020
[43]

Chaoqi Wang, Zhuokai Zhao, Chen Zhu, Karthik Abinav Sankararaman, Michal Valko, Xuefei Cao, Zhaorun Chen, Madian Khabsa, Yuxin Chen, Hao Ma, and 1 others. 2024 a . Preference optimization with multi-sample comparisons

2024
[44]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, and 1 others. 2024 b . Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266--95290

2024
[45]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025 a . Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Xinyu Yang, Junlin Han, Rishi Bommasani, Jinqi Luo, Wenjie Qu, Wangchunshu Zhou, Adel Bibi, Xiyao Wang, Jaehong Yoon, Elias Stengel-Eskin, and 1 others. 2025 b . Reliable and responsible foundation models. Transactions on Machine Learning Research

2025
[47]

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, and 1 others. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652

work page internal anchor Pith review arXiv 2024
[48]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302

work page arXiv 2023
[50]

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2024. Agieval: A human-centric benchmark for evaluating foundation models. In Findings of the association for computational linguistics: NAACL 2024, pages 2299--2314

2024
[51]

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023