Recognition: no theorem link
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
Pith reviewed 2026-05-13 19:45 UTC · model grok-4.3
The pith
Active preference learning adds negligible gains over random sampling in online DPO for modern LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that in the regime of strong pre-trained priors, uncertainty-based active preference learning yields negligible improvements in proxy win-rates compared to random sampling in online DPO. Win-rates can rise even as general capability degrades on standard benchmarks, and active selection does not mitigate this capability collapse or reduce variance significantly better than random. The computational overhead of active selection is therefore difficult to justify against the cheap diversity of simple random samples.
What carries the argument
Uncertainty-based active selection versus random sampling inside online Direct Preference Optimization (DPO), measured by reward-model and LLM-as-a-judge proxy win-rates.
Load-bearing premise
The proxy metrics of reward models and LLM-as-a-judge accurately track both alignment gains and capability preservation without being distorted by the same capability collapse.
What would settle it
An experiment in which active selection produces substantially higher proxy win-rates than random sampling while also preserving or improving scores on standard capability benchmarks such as MMLU.
Figures
read the original abstract
Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical comparison of uncertainty-based Active Preference Learning (APL) versus simple random sampling in the context of online Direct Preference Optimization (DPO) applied to modern LLMs. Across harmlessness, helpfulness, and instruction-following tasks, it uses reward-model and LLM-as-a-judge proxies to measure alignment via win-rates. The central claims are that APL yields only negligible improvements over random sampling, that proxy win-rates rise even as standard capability benchmarks degrade, and that APL does not meaningfully reduce variance or mitigate this capability collapse. The authors conclude that random sampling's 'cheap diversity' makes the overhead of active selection difficult to justify for models with strong pre-trained priors.
Significance. If the dissociation between proxy win-rates and capability metrics is robust, the work provides a useful cautionary result for the alignment community: sophisticated query-selection strategies may add little value once pre-training priors are already strong. It elevates random sampling as a competitive baseline and underscores the risk of capability degradation during preference optimization. The open-source code link is a positive contribution for reproducibility.
major comments (3)
- [Experiments] Experiments section: the manuscript reports no sample sizes, number of independent runs, or statistical tests (p-values, confidence intervals, or error bars) for the proxy win-rate comparisons. Without these, the claim that APL yields 'negligible improvements' over Random cannot be rigorously evaluated and remains vulnerable to sampling noise.
- [Results] Results (dissociation claim): the observation that win-rates improve while capability benchmarks degrade lacks controls or cross-validation (e.g., human labels, held-out judges, or capability-matched subsets) to rule out the possibility that the same proxies are biased toward shorter or more predictable outputs produced under capability collapse.
- [Method] Method: details on the uncertainty quantification used for APL (e.g., how epistemic uncertainty is estimated from the on-policy candidate pool or which specific acquisition function is applied) are insufficient to allow reproduction or direct comparison with other active-learning variants.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly state the number of tasks and models evaluated to give readers an immediate sense of experimental scope.
- [Figures] Figure captions should include the exact number of samples or runs underlying each bar or curve for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps strengthen the statistical presentation, clarify potential limitations in the results, and improve reproducibility. We address each major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the manuscript reports no sample sizes, number of independent runs, or statistical tests (p-values, confidence intervals, or error bars) for the proxy win-rate comparisons. Without these, the claim that APL yields 'negligible improvements' over Random cannot be rigorously evaluated and remains vulnerable to sampling noise.
Authors: We agree that additional statistical details are needed for rigorous evaluation. In the revised manuscript we will report the exact sample sizes used for each win-rate comparison (500 pairs per setting), the number of independent runs (3 random seeds), standard deviation error bars across runs, and 95% confidence intervals. We will also include a note on paired statistical tests confirming that observed differences between APL and Random are not significant. revision: yes
-
Referee: [Results] Results (dissociation claim): the observation that win-rates improve while capability benchmarks degrade lacks controls or cross-validation (e.g., human labels, held-out judges, or capability-matched subsets) to rule out the possibility that the same proxies are biased toward shorter or more predictable outputs produced under capability collapse.
Authors: We acknowledge the concern about possible proxy bias. Our current evidence relies on two independent proxy families (reward models and LLM-as-a-judge) and the fact that capability degradation appears on standard, non-proxy benchmarks such as MMLU and GSM8K. In the revision we will add an explicit limitations paragraph discussing the risk that proxies may favor shorter outputs and note that full human validation or held-out judges was outside the scope of this study. We maintain that the dissociation is still informative but treat the lack of additional controls as a limitation. revision: partial
-
Referee: [Method] Method: details on the uncertainty quantification used for APL (e.g., how epistemic uncertainty is estimated from the on-policy candidate pool or which specific acquisition function is applied) are insufficient to allow reproduction or direct comparison with other active-learning variants.
Authors: We agree that the current description is too brief. The revised method section will specify that epistemic uncertainty is estimated via the variance of log-probabilities under the current policy on the on-policy candidate pool and that the acquisition function is standard uncertainty sampling (selecting the k pairs with highest variance). We will also include pseudocode for the full APL loop to enable direct reproduction and comparison. revision: yes
Circularity Check
No significant circularity: empirical comparison against external benchmarks
full rationale
The paper reports results from direct experimental comparisons of APL versus random sampling in online DPO, measuring proxy win-rates and capability benchmarks. No equations, derivations, or fitted parameters are presented that reduce claims to self-defined quantities or self-citations. All load-bearing statements rest on observed outcomes from runs against standard external proxies and benchmarks, with no internal redefinition or prediction-by-construction. This is a standard empirical study whose central dissociation claim is falsifiable outside the paper's own fitted values.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Proxy win-rates from reward models and LLM judges correlate with true human preferences and alignment quality
- domain assumption Standard capability benchmarks measure general ability independently of the alignment objectives being optimized
Reference graph
Works this paper leans on
-
[1]
Deep batch active learning by diverse, uncertain gradient lower bounds
Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. 2020
work page 2020
-
[2]
Uncertainty herding: One active learning method for all label budgets
Wonho Bae, Gabriel L Oliveira, and Danica J Sutherland. Uncertainty herding: One active learning method for all label budgets. 2025
work page 2025
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020
work page 2020
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[6]
Alpagasus: Training a better alpaca with fewer data
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Kumar, Yang Liu, Devi Parikh, and Siyu Xu. Alpagasus: Training a better alpaca with fewer data. In International Conference on Learning Representations, 2024
work page 2024
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[8]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[9]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT, pp.\ 2924--2936, 2019
work page 2019
-
[10]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Ultrafeedback: Boosting language models with high-quality feedback, 2023
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023
work page 2023
-
[12]
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Less is more: Improving llm alignment via preference data selection
Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. Less is more: Improving llm alignment via preference data selection. arXiv preprint arXiv:2502.14560, 2025
-
[14]
Alpacafarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[15]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2023
work page 2023
-
[16]
The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024
Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024
-
[19]
Deberta: Decoding-enhanced bert with disentangled attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021
work page 2021
-
[20]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022
work page 2022
-
[21]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
-
[22]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[23]
Active learning for direct preference optimization
Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization. arXiv preprint arXiv:2503.01076, 2025
-
[24]
Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy. arXiv preprint arXiv:2507.01352, 2025
-
[25]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 2381--2391, 2018
work page 2018
-
[26]
Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023
-
[27]
Active preference learning for large language models
William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. In International Conference on Machine Learning, 2024
work page 2024
-
[28]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in neural information processing systems, 2022
work page 2022
-
[29]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023
work page 2023
-
[31]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In COLM, 2024
work page 2024
-
[32]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021
work page 2021
-
[33]
Active hidden M arkov models for information extraction
Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden M arkov models for information extraction. In ISIDA, 2001
work page 2001
-
[34]
Active learning for convolutional neural networks: A core-set approach
Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In ICLR, 2018
work page 2018
-
[35]
Active learning literature survey
Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009
work page 2009
-
[36]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
A new active labeling method for deep learning
Dan Wang and Yi Shang. A new active labeling method for deep learning. In IJCNN, 2014
work page 2014
-
[38]
Bpo: Staying close to the behavior llm creates better online llm alignment
Wenda Xu, Jiachen Li, William Yang Wang, and Lei Li. Bpo: Staying close to the behavior llm creates better online llm alignment. In Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[39]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Active learning through a covering lens
Ofer Yehuda, Avihu Dekel, Guy Hacohen, and Daphna Weinshall. Active learning through a covering lens. In NeurIPS, 2022
work page 2022
-
[41]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, 2019
work page 2019
-
[42]
Lima: Less is more for alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. In Advances in Neural Information Processing Systems, 2023 a
work page 2023
-
[43]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[45]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[46]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.