arxiv: 2604.02766 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Giyeong Oh , Junghyun Lee , Jaehyun Park , Youngjae Yu , Wonho Bae , Junhyug Noh

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords active preference learningdirect preference optimizationrandom samplingLLM alignmentcapability degradationonline DPOproxy win-rates

0 comments

The pith

Active preference learning adds negligible gains over random sampling in online DPO for modern LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern LLMs carry strong priors from web-scale pretraining that limit how much careful data selection can improve post-training results. The paper compares uncertainty-based active preference learning against simple random sampling when selecting on-policy data for online direct preference optimization. Across harmlessness, helpfulness, and instruction-following tasks, and using both reward models and LLM-as-a-judge evaluators, active methods show almost no improvement in proxy win-rates. Alignment gains appear alongside drops in general capabilities measured by standard benchmarks, and active selection fails to reduce variance or prevent this degradation any better than random. The findings indicate that the extra computation of active selection is hard to justify when random sampling already supplies sufficient variety.

Core claim

The central claim is that in the regime of strong pre-trained priors, uncertainty-based active preference learning yields negligible improvements in proxy win-rates compared to random sampling in online DPO. Win-rates can rise even as general capability degrades on standard benchmarks, and active selection does not mitigate this capability collapse or reduce variance significantly better than random. The computational overhead of active selection is therefore difficult to justify against the cheap diversity of simple random samples.

What carries the argument

Uncertainty-based active selection versus random sampling inside online Direct Preference Optimization (DPO), measured by reward-model and LLM-as-a-judge proxy win-rates.

Load-bearing premise

The proxy metrics of reward models and LLM-as-a-judge accurately track both alignment gains and capability preservation without being distorted by the same capability collapse.

What would settle it

An experiment in which active selection produces substantially higher proxy win-rates than random sampling while also preserving or improving scores on standard capability benchmarks such as MMLU.

Figures

Figures reproduced from arXiv: 2604.02766 by Giyeong Oh, Jaehyun Park, Junghyun Lee, Junhyug Noh, Wonho Bae, Youngjae Yu.

**Figure 2.** Figure 2: Qwen3-1.7B across datasets and judges. DeBERTa: APL underperforms RANDOM despite comparable or higher proxy win-rates. Skywork: no statistically significant difference between APL and RANDOM. GPT-5-mini: APL performs worse than RANDOM under the same budget. et al. (2018); Clark et al. (2018); Sakaguchi et al. (2021); Zellers et al. (2019); Bisk et al. (2020); Clark et al. (2019). 3.3 RESULTS AND ANALYSIS … view at source ↗

**Figure 3.** Figure 3: Judge Scaling (GPT-5 Family). We perform online DPO with Qwen2.5-7B using the GPT-5 family as both annotator and evaluator on Ultrafeedback. Moreover, APL incurs approximately 20.2× wallclock overhead per query–update cycle compared to RANDOM (Appendix C.4), making even marginal gains difficult to justify in practice. When does APL help? A cross-cutting view of Appendix Tables 7–9 reveals that APL’s cle… view at source ↗

read the original abstract

Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Random sampling competes well with active selection in modern LLM DPO, but the proxy win-rate gains look vulnerable to the same capability collapse they claim to measure.

read the letter

Random sampling holds up surprisingly well against uncertainty-based active preference learning in online DPO for current LLMs, and the paper's clearest contribution is documenting that proxy win-rates can rise even while standard capability benchmarks fall. The experiments compare the two selection strategies across harmlessness, helpfulness, and instruction-following tasks using both reward models and LLM-as-a-judge signals. The finding that APL adds little over random, plus the observed dissociation between alignment proxies and general capabilities, is a direct empirical check on whether active methods still justify their overhead once pretraining priors are strong. Making the code public is a practical plus for anyone who wants to rerun or adapt the setup. The work does a clean job of testing a common assumption in the alignment literature and reporting the negative result plainly. The soft spot is the heavy reliance on the same class of proxies for both data selection and evaluation. If capability degradation produces shorter, more predictable, or stylistically favored outputs that the reward models and judges happen to score higher, the reported win-rate improvements could be partly artifactual rather than evidence of better alignment. The abstract gives no sign of cross-checks such as human labels, held-out judges, or capability-controlled subsets to test whether the proxies remain valid under degradation. Without visible sample sizes, error bars, or statistical tests in the summary, the claim of negligible differences also rests on moderate evidence. This is the sort of practical negative result that matters for post-training practice. It is worth bringing to a reading group to talk through proxy robustness. It deserves peer review because it tests an active-learning assumption with concrete modern-LLM experiments; the proxy concern is fixable with additional validation in revision.

Referee Report

3 major / 2 minor

Summary. The paper presents an empirical comparison of uncertainty-based Active Preference Learning (APL) versus simple random sampling in the context of online Direct Preference Optimization (DPO) applied to modern LLMs. Across harmlessness, helpfulness, and instruction-following tasks, it uses reward-model and LLM-as-a-judge proxies to measure alignment via win-rates. The central claims are that APL yields only negligible improvements over random sampling, that proxy win-rates rise even as standard capability benchmarks degrade, and that APL does not meaningfully reduce variance or mitigate this capability collapse. The authors conclude that random sampling's 'cheap diversity' makes the overhead of active selection difficult to justify for models with strong pre-trained priors.

Significance. If the dissociation between proxy win-rates and capability metrics is robust, the work provides a useful cautionary result for the alignment community: sophisticated query-selection strategies may add little value once pre-training priors are already strong. It elevates random sampling as a competitive baseline and underscores the risk of capability degradation during preference optimization. The open-source code link is a positive contribution for reproducibility.

major comments (3)

[Experiments] Experiments section: the manuscript reports no sample sizes, number of independent runs, or statistical tests (p-values, confidence intervals, or error bars) for the proxy win-rate comparisons. Without these, the claim that APL yields 'negligible improvements' over Random cannot be rigorously evaluated and remains vulnerable to sampling noise.
[Results] Results (dissociation claim): the observation that win-rates improve while capability benchmarks degrade lacks controls or cross-validation (e.g., human labels, held-out judges, or capability-matched subsets) to rule out the possibility that the same proxies are biased toward shorter or more predictable outputs produced under capability collapse.
[Method] Method: details on the uncertainty quantification used for APL (e.g., how epistemic uncertainty is estimated from the on-policy candidate pool or which specific acquisition function is applied) are insufficient to allow reproduction or direct comparison with other active-learning variants.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the number of tasks and models evaluated to give readers an immediate sense of experimental scope.
[Figures] Figure captions should include the exact number of samples or runs underlying each bar or curve for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the statistical presentation, clarify potential limitations in the results, and improve reproducibility. We address each major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript reports no sample sizes, number of independent runs, or statistical tests (p-values, confidence intervals, or error bars) for the proxy win-rate comparisons. Without these, the claim that APL yields 'negligible improvements' over Random cannot be rigorously evaluated and remains vulnerable to sampling noise.

Authors: We agree that additional statistical details are needed for rigorous evaluation. In the revised manuscript we will report the exact sample sizes used for each win-rate comparison (500 pairs per setting), the number of independent runs (3 random seeds), standard deviation error bars across runs, and 95% confidence intervals. We will also include a note on paired statistical tests confirming that observed differences between APL and Random are not significant. revision: yes
Referee: [Results] Results (dissociation claim): the observation that win-rates improve while capability benchmarks degrade lacks controls or cross-validation (e.g., human labels, held-out judges, or capability-matched subsets) to rule out the possibility that the same proxies are biased toward shorter or more predictable outputs produced under capability collapse.

Authors: We acknowledge the concern about possible proxy bias. Our current evidence relies on two independent proxy families (reward models and LLM-as-a-judge) and the fact that capability degradation appears on standard, non-proxy benchmarks such as MMLU and GSM8K. In the revision we will add an explicit limitations paragraph discussing the risk that proxies may favor shorter outputs and note that full human validation or held-out judges was outside the scope of this study. We maintain that the dissociation is still informative but treat the lack of additional controls as a limitation. revision: partial
Referee: [Method] Method: details on the uncertainty quantification used for APL (e.g., how epistemic uncertainty is estimated from the on-policy candidate pool or which specific acquisition function is applied) are insufficient to allow reproduction or direct comparison with other active-learning variants.

Authors: We agree that the current description is too brief. The revised method section will specify that epistemic uncertainty is estimated via the variance of log-probabilities under the current policy on the on-policy candidate pool and that the acquisition function is standard uncertainty sampling (selecting the k pairs with highest variance). We will also include pseudocode for the full APL loop to enable direct reproduction and comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison against external benchmarks

full rationale

The paper reports results from direct experimental comparisons of APL versus random sampling in online DPO, measuring proxy win-rates and capability benchmarks. No equations, derivations, or fitted parameters are presented that reduce claims to self-defined quantities or self-citations. All load-bearing statements rest on observed outcomes from runs against standard external proxies and benchmarks, with no internal redefinition or prediction-by-construction. This is a standard empirical study whose central dissociation claim is falsifiable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of proxy metrics and standard assumptions about benchmark independence; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption Proxy win-rates from reward models and LLM judges correlate with true human preferences and alignment quality
Invoked when reporting improvements and when claiming negligible gains for APL.
domain assumption Standard capability benchmarks measure general ability independently of the alignment objectives being optimized
Used to support the dissociation claim between win-rate and capability.

pith-pipeline@v0.9.0 · 5499 in / 1336 out tokens · 44254 ms · 2026-05-13T19:45:14.053379+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 9 internal anchors

[1]

Deep batch active learning by diverse, uncertain gradient lower bounds

Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. 2020

work page 2020
[2]

Uncertainty herding: One active learning method for all label budgets

Wonho Bae, Gabriel L Oliveira, and Danica J Sutherland. Uncertainty herding: One active learning method for all label budgets. 2025

work page 2025
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

work page 2020
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[6]

Alpagasus: Training a better alpaca with fewer data

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Kumar, Yang Liu, Devi Parikh, and Siyu Xu. Alpagasus: Training a better alpaca with fewer data. In International Conference on Learning Representations, 2024

work page 2024
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[8]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017

work page 2017
[9]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT, pp.\ 2924--2936, 2019

work page 2019
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Ultrafeedback: Boosting language models with high-quality feedback, 2023

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

work page 2023
[12]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Less is more: Improving llm alignment via preference data selection

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. Less is more: Improving llm alignment via preference data selection. arXiv preprint arXiv:2502.14560, 2025

work page arXiv 2025
[14]

Alpacafarm: A simulation framework for methods that learn from human feedback

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems, 2023

work page 2023
[15]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2023

work page 2023
[16]

The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024

work page arXiv 2024
[19]

Deberta: Decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021

work page 2021
[20]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022

work page 2022
[21]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[22]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[23]

Active learning for direct preference optimization

Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization. arXiv preprint arXiv:2503.01076, 2025

work page arXiv 2025
[24]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy. arXiv preprint arXiv:2507.01352, 2025

work page arXiv 2025
[25]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 2381--2391, 2018

work page 2018
[26]

Confronting reward model overoptimization with constrained rlhf.arXiv preprint arXiv:2310.04373, 2023

Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023

work page arXiv 2023
[27]

Active preference learning for large language models

William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. In International Conference on Machine Learning, 2024

work page 2024
[28]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in neural information processing systems, 2022

work page 2022
[29]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

work page 2023
[31]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In COLM, 2024

work page 2024
[32]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021

work page 2021
[33]

Active hidden M arkov models for information extraction

Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden M arkov models for information extraction. In ISIDA, 2001

work page 2001
[34]

Active learning for convolutional neural networks: A core-set approach

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In ICLR, 2018

work page 2018
[35]

Active learning literature survey

Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009

work page 2009
[36]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

A new active labeling method for deep learning

Dan Wang and Yi Shang. A new active labeling method for deep learning. In IJCNN, 2014

work page 2014
[38]

Bpo: Staying close to the behavior llm creates better online llm alignment

Wenda Xu, Jiachen Li, William Yang Wang, and Lei Li. Bpo: Staying close to the behavior llm creates better online llm alignment. In Empirical Methods in Natural Language Processing, 2024

work page 2024
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Active learning through a covering lens

Ofer Yehuda, Avihu Dekel, Guy Hacohen, and Daphna Weinshall. Active learning through a covering lens. In NeurIPS, 2022

work page 2022
[41]

Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, 2019

work page 2019
[42]

Lima: Less is more for alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. In Advances in Neural Information Processing Systems, 2023 a

work page 2023
[43]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[45]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[46]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page