pith. machine review for the scientific record. sign in

arxiv: 2604.02766 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords active preference learningdirect preference optimizationrandom samplingLLM alignmentcapability degradationonline DPOproxy win-rates
0
0 comments X

The pith

Active preference learning adds negligible gains over random sampling in online DPO for modern LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern LLMs carry strong priors from web-scale pretraining that limit how much careful data selection can improve post-training results. The paper compares uncertainty-based active preference learning against simple random sampling when selecting on-policy data for online direct preference optimization. Across harmlessness, helpfulness, and instruction-following tasks, and using both reward models and LLM-as-a-judge evaluators, active methods show almost no improvement in proxy win-rates. Alignment gains appear alongside drops in general capabilities measured by standard benchmarks, and active selection fails to reduce variance or prevent this degradation any better than random. The findings indicate that the extra computation of active selection is hard to justify when random sampling already supplies sufficient variety.

Core claim

The central claim is that in the regime of strong pre-trained priors, uncertainty-based active preference learning yields negligible improvements in proxy win-rates compared to random sampling in online DPO. Win-rates can rise even as general capability degrades on standard benchmarks, and active selection does not mitigate this capability collapse or reduce variance significantly better than random. The computational overhead of active selection is therefore difficult to justify against the cheap diversity of simple random samples.

What carries the argument

Uncertainty-based active selection versus random sampling inside online Direct Preference Optimization (DPO), measured by reward-model and LLM-as-a-judge proxy win-rates.

Load-bearing premise

The proxy metrics of reward models and LLM-as-a-judge accurately track both alignment gains and capability preservation without being distorted by the same capability collapse.

What would settle it

An experiment in which active selection produces substantially higher proxy win-rates than random sampling while also preserving or improving scores on standard capability benchmarks such as MMLU.

Figures

Figures reproduced from arXiv: 2604.02766 by Giyeong Oh, Jaehyun Park, Junghyun Lee, Junhyug Noh, Wonho Bae, Youngjae Yu.

Figure 1
Figure 1. Figure 1: Harmlessness alignment stability (Pareto frontier). [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qwen3-1.7B across datasets and judges. DeBERTa: APL underperforms RANDOM despite com￾parable or higher proxy win-rates. Skywork: no statistically significant difference between APL and RANDOM. GPT-5-mini: APL performs worse than RANDOM under the same budget. et al. (2018); Clark et al. (2018); Sakaguchi et al. (2021); Zellers et al. (2019); Bisk et al. (2020); Clark et al. (2019). 3.3 RESULTS AND ANALYSIS … view at source ↗
Figure 3
Figure 3. Figure 3: Judge Scaling (GPT-5 Family). We perform online DPO with Qwen2.5-7B using the GPT-5 family as both annotator and evalua￾tor on Ultrafeedback. Moreover, APL incurs approximately 20.2× wall￾clock overhead per query–update cycle compared to RANDOM (Appendix C.4), making even marginal gains difficult to justify in practice. When does APL help? A cross-cutting view of Ap￾pendix Tables 7–9 reveals that APL’s cle… view at source ↗
read the original abstract

Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents an empirical comparison of uncertainty-based Active Preference Learning (APL) versus simple random sampling in the context of online Direct Preference Optimization (DPO) applied to modern LLMs. Across harmlessness, helpfulness, and instruction-following tasks, it uses reward-model and LLM-as-a-judge proxies to measure alignment via win-rates. The central claims are that APL yields only negligible improvements over random sampling, that proxy win-rates rise even as standard capability benchmarks degrade, and that APL does not meaningfully reduce variance or mitigate this capability collapse. The authors conclude that random sampling's 'cheap diversity' makes the overhead of active selection difficult to justify for models with strong pre-trained priors.

Significance. If the dissociation between proxy win-rates and capability metrics is robust, the work provides a useful cautionary result for the alignment community: sophisticated query-selection strategies may add little value once pre-training priors are already strong. It elevates random sampling as a competitive baseline and underscores the risk of capability degradation during preference optimization. The open-source code link is a positive contribution for reproducibility.

major comments (3)
  1. [Experiments] Experiments section: the manuscript reports no sample sizes, number of independent runs, or statistical tests (p-values, confidence intervals, or error bars) for the proxy win-rate comparisons. Without these, the claim that APL yields 'negligible improvements' over Random cannot be rigorously evaluated and remains vulnerable to sampling noise.
  2. [Results] Results (dissociation claim): the observation that win-rates improve while capability benchmarks degrade lacks controls or cross-validation (e.g., human labels, held-out judges, or capability-matched subsets) to rule out the possibility that the same proxies are biased toward shorter or more predictable outputs produced under capability collapse.
  3. [Method] Method: details on the uncertainty quantification used for APL (e.g., how epistemic uncertainty is estimated from the on-policy candidate pool or which specific acquisition function is applied) are insufficient to allow reproduction or direct comparison with other active-learning variants.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly state the number of tasks and models evaluated to give readers an immediate sense of experimental scope.
  2. [Figures] Figure captions should include the exact number of samples or runs underlying each bar or curve for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the statistical presentation, clarify potential limitations in the results, and improve reproducibility. We address each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the manuscript reports no sample sizes, number of independent runs, or statistical tests (p-values, confidence intervals, or error bars) for the proxy win-rate comparisons. Without these, the claim that APL yields 'negligible improvements' over Random cannot be rigorously evaluated and remains vulnerable to sampling noise.

    Authors: We agree that additional statistical details are needed for rigorous evaluation. In the revised manuscript we will report the exact sample sizes used for each win-rate comparison (500 pairs per setting), the number of independent runs (3 random seeds), standard deviation error bars across runs, and 95% confidence intervals. We will also include a note on paired statistical tests confirming that observed differences between APL and Random are not significant. revision: yes

  2. Referee: [Results] Results (dissociation claim): the observation that win-rates improve while capability benchmarks degrade lacks controls or cross-validation (e.g., human labels, held-out judges, or capability-matched subsets) to rule out the possibility that the same proxies are biased toward shorter or more predictable outputs produced under capability collapse.

    Authors: We acknowledge the concern about possible proxy bias. Our current evidence relies on two independent proxy families (reward models and LLM-as-a-judge) and the fact that capability degradation appears on standard, non-proxy benchmarks such as MMLU and GSM8K. In the revision we will add an explicit limitations paragraph discussing the risk that proxies may favor shorter outputs and note that full human validation or held-out judges was outside the scope of this study. We maintain that the dissociation is still informative but treat the lack of additional controls as a limitation. revision: partial

  3. Referee: [Method] Method: details on the uncertainty quantification used for APL (e.g., how epistemic uncertainty is estimated from the on-policy candidate pool or which specific acquisition function is applied) are insufficient to allow reproduction or direct comparison with other active-learning variants.

    Authors: We agree that the current description is too brief. The revised method section will specify that epistemic uncertainty is estimated via the variance of log-probabilities under the current policy on the on-policy candidate pool and that the acquisition function is standard uncertainty sampling (selecting the k pairs with highest variance). We will also include pseudocode for the full APL loop to enable direct reproduction and comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical comparison against external benchmarks

full rationale

The paper reports results from direct experimental comparisons of APL versus random sampling in online DPO, measuring proxy win-rates and capability benchmarks. No equations, derivations, or fitted parameters are presented that reduce claims to self-defined quantities or self-citations. All load-bearing statements rest on observed outcomes from runs against standard external proxies and benchmarks, with no internal redefinition or prediction-by-construction. This is a standard empirical study whose central dissociation claim is falsifiable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of proxy metrics and standard assumptions about benchmark independence; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Proxy win-rates from reward models and LLM judges correlate with true human preferences and alignment quality
    Invoked when reporting improvements and when claiming negligible gains for APL.
  • domain assumption Standard capability benchmarks measure general ability independently of the alignment objectives being optimized
    Used to support the dissociation claim between win-rate and capability.

pith-pipeline@v0.9.0 · 5499 in / 1336 out tokens · 44254 ms · 2026-05-13T19:45:14.053379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 9 internal anchors

  1. [1]

    Deep batch active learning by diverse, uncertain gradient lower bounds

    Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. 2020

  2. [2]

    Uncertainty herding: One active learning method for all label budgets

    Wonho Bae, Gabriel L Oliveira, and Danica J Sutherland. Uncertainty herding: One active learning method for all label budgets. 2025

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  4. [4]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  6. [6]

    Alpagasus: Training a better alpaca with fewer data

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Kumar, Yang Liu, Devi Parikh, and Siyu Xu. Alpagasus: Training a better alpaca with fewer data. In International Conference on Learning Representations, 2024

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  8. [8]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017

  9. [9]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT, pp.\ 2924--2936, 2019

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  11. [11]

    Ultrafeedback: Boosting language models with high-quality feedback, 2023

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

  12. [12]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023

  13. [13]

    Less is more: Improving llm alignment via preference data selection

    Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. Less is more: Improving llm alignment via preference data selection. arXiv preprint arXiv:2502.14560, 2025

  14. [14]

    Alpacafarm: A simulation framework for methods that learn from human feedback

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. In Advances in Neural Information Processing Systems, 2023

  15. [15]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2023

  16. [16]

    The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  18. [18]

    Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

    Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024

  19. [19]

    Deberta: Decoding-enhanced bert with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021

  20. [20]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022

  21. [21]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  22. [22]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  23. [23]

    Active learning for direct preference optimization

    Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, and Tong Yu. Active learning for direct preference optimization. arXiv preprint arXiv:2503.01076, 2025

  24. [24]

    Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

    Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy. arXiv preprint arXiv:2507.01352, 2025

  25. [25]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 2381--2391, 2018

  26. [26]

    Confronting reward model overoptimization with constrained rlhf.arXiv preprint arXiv:2310.04373, 2023

    Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373, 2023

  27. [27]

    Active preference learning for large language models

    William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. In International Conference on Machine Learning, 2024

  28. [28]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in neural information processing systems, 2022

  29. [29]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  30. [30]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

  31. [31]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In COLM, 2024

  32. [32]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021

  33. [33]

    Active hidden M arkov models for information extraction

    Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden M arkov models for information extraction. In ISIDA, 2001

  34. [34]

    Active learning for convolutional neural networks: A core-set approach

    Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In ICLR, 2018

  35. [35]

    Active learning literature survey

    Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009

  36. [36]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  37. [37]

    A new active labeling method for deep learning

    Dan Wang and Yi Shang. A new active labeling method for deep learning. In IJCNN, 2014

  38. [38]

    Bpo: Staying close to the behavior llm creates better online llm alignment

    Wenda Xu, Jiachen Li, William Yang Wang, and Lei Li. Bpo: Staying close to the behavior llm creates better online llm alignment. In Empirical Methods in Natural Language Processing, 2024

  39. [39]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  40. [40]

    Active learning through a covering lens

    Ofer Yehuda, Avihu Dekel, Guy Hacohen, and Daphna Weinshall. Active learning through a covering lens. In NeurIPS, 2022

  41. [41]

    Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4791--4800, 2019

  42. [42]

    Lima: Less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. In Advances in Neural Information Processing Systems, 2023 a

  43. [43]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023 b

  44. [44]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  45. [45]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  46. [46]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...